Data Cohesion: From Similarity Comparisons to Clustering
We often want to observe the shape of our data and will use clustering and data visualization methods to do so. These methods typically require that our data is described with respect to a relatively small set of variables or that we provide distances among all pairs of points. For many interesting problems, however, this initial step can be quite challenging. In such a case, we may instead wish to work from a set of responses to similarity comparisons (e.g., among x, y, and z, which one is the outlier?). I will introduce cohesion, a new measure of relative proximity that is built on this comparison framework and show how cohesion offers a perspective on our data that is quite different from distance alone and can help address challenges that arise in high-dimensional settings. I will also share some initial progress toward the development of cohesion-based methods for clustering and data visualization.
Abstract: Understanding community responses to climate is critical for anticipating the future impacts of global change. However, despite increased research efforts in this field, models that explicitly include important biological mechanisms are lacking. Quantifying the potential impacts of climate change on species is complicated by the fact that the effects of climate variation may manifest at several points in the biological process. To this end, we formulate a dynamic mechanistic model that combines population dynamics, such as species interactions, with species redistribution by allowing climate to affect both processes. We examine their relative contributions in an application to the changing biomass of a community of eight species in the Gulf of Maine using over 30 years of fisheries data from the Northeast Fishery Science Center. Our model suggests that the mechanisms driving biomass trends vary across space, time, and species.
Conditional power for cluster-randomized trials with interval-censored endpoints
Cluster-randomized trials (CRTs) of infectious disease progression often result in data where individuals belonging to the same contact networks and communities are more likely to be similar to one another. In addition, their infection status may be assessed only at intermittent study visits. The design, monitoring, and analysis of these CRTs must account for this data structure. I will discuss a flexible, simulation-based framework for conducting interim monitoring when outcomes are correlated and interval-censored and will show that this approach produces valid estimates of a trial’s ultimate probability of success (termed the conditional power) across a range of data-generating mechanisms and CRT design considerations. The framework also has high accuracy in classifying trials as futile based on available interim data. I will illustrate its use by applying it to the Botswana Combination Prevention Project, a cluster-randomized HIV prevention trial.
Talks in the Statistics and Data Science Colloquium are free and open to the public. They are intended to be accessible to a broad audience with some background in statistics and data science. Junior and senior statistics majors are expected to attend talks in the SDS Colloquia. Please reach out to Professor Nicholas Horton in case of conflicts.
Seeley Mudd Hall is located at the southwest corner of the first year Quadrangle (31 Quadrangle Drive). Paid parking is available at the Amherst Town Common and Boltwood Drive (approximately 8 minute walk). PVTA Bus Service is available from the Converse Hall stop.
Last updated October 5, 2023
Copyright © 2023 Amherst College. All rights reserved.