REU Research Projects

This year’s projects are subject to change depending on student interests, but here is a general idea of what to expect.

Drug safety, or pharmacovigilance, is a critical component of public health. Although clinical trials identify many potential adverse events (AEs), their limited sample size and duration make it difficult to detect rare outcomes or complex drug–drug interactions (DDIs). Post-marketing surveillance systems—such as the FDA Adverse Event Reporting System (FAERS) and the Vaccine Adverse Event Reporting System (VAERS)—are therefore essential for detecting safety signals in real-world populations. However, these spontaneous reporting systems (SRSs) are voluntary and often contain high-dimensional, sparse data subject to confounding and reporting biases, posing significant methodological and computational challenges. 

To address these issues, Tan et al. (2020) proposed a two-stage hierarchical modeling framework: first, it screens for associations between a drug and classes of AEs (based on system organ class), then it conducts targeted follow-up on specific AEs within the flagged classes. This approach integrates random-effects models with joint hypothesis testing to improve statistical power and control false positives. 

In this project, we will first evaluate the performance of this hierarchical framework against state-of-the-art machine learning methods—such as Gradient Boosting Machines (GBM) and Random Forests (RF) (Bae et al., 2021)—for safety signal detection in SRS databases. We will then extend the framework by incorporating additional data sources, such as gene expression profiles (Mohsen et al., 2020) and electronic health records (Hu et al., 2024), to enhance its ability to detect diverse and complex AEs. 

Students interested in this project should have a solid foundation in linear algebra, probability, regression modeling, as well as programming experience in Python or R. 

In many applications a matched-pairs design is used to control variability between subjects and allow for more precise treatment comparisons. However, missing data may occur because of the inability to obtain one of the measurements. For example, in a clinical trial to compare two methods of eye laser surgery (Dubnicka et al., 2002), patients may have one eye assigned to the new method and the other eye to the current method. However, it may be discovered that some patients have only one eye eligible for study. Another example is a design in which subjects are measured at two time points in time with an intervention in between. Subjects may be lost to follow-up and thus no observation at the second time point is available. In these situations, a mixture of complete and incomplete pairs of data occurs.  

One strategy to deal with such data is to simply ignore the unpaired observations and analyze only the complete pairs. However, this may introduce bias due to systematically missing data, and there may be a loss of power to detect treatment differences. ideally information from the incomplete pairs could be combined with that of the complete pairs. 

Parametric and nonparametric approaches have been proposed to incorporate information from in complete pairs (Lin and Stivers, 1974; Bhoj, 1978, 1989; Dubnicka et al., 2002; Einsporn and Habtzghi, 2013; Derrick, 2020; and Johnson, 2022). However, the effect of several factors on the performance of these methods is not well understood. In this project we will study one or more of the following questions:  

  • What is the effect of relative sample size and covariance structure between paired observations on the tests?  
  • Can the methods be extended to more than two treatments or groups?  

Interested students should have completed introductory courses in both mathematical statistics and applied statistical methods and should have familiarity with a programming language. 

Artificial Intelligence (AI) is rapidly transforming technology, science, and society. At the heart of many AI systems are neural networks (NNs), which have shown remarkable performance across a range of tasks. However, neural networks are often seen as black boxes, it remains challenging to understand when, why, and how well they work. 

Recent research has revealed a deep connection between neural networks and Gaussian processes (GPs) (Lee et al., 2018). The study of GPs is a fundamental topic in probability and statistics. GPs are primarily characterized by their covariance functions. Therefore, the architecture of a neural network including its activation functions, initialization, depth of the layers and convolution layers can be featured in modeling the corresponding GP covariance function.  

This REU project aims to explore and leverage this connection. By translating NN architectures into GP models, we can develop a mathematical tractable and computational scalable approach to analyze and design better-performing neural networks.  

Dr. Huang proposes the following research directions for REU participants: 

  • Derive GP covariance functions for different NN architectures, including varying initialization, depths, and activation functions. 
  • Use the GP perspective to guide the selection of effective NN architecture, such as choices of initialization, number of layers, and activation functions. 
  • Conduct simulations to evaluate theoretical derivation. Participants will apply the approach to well-known AI benchmark datasets, such as MNIST and CIFAR, and assess its practical effectiveness. 

These projects align closely with Dr. Huang’s ongoing research interests in stochastic processes, Markov random fields, and neural networks (Liu et al., 2025), and offer students the opportunity to contribute to an active and evolving area of work at the intersection of statistics, machine learning and artificial intelligence. 

Students who want to pursue this project should have taken a course on probability and statistics, linear regression, and should have some familiarity with Python or R.  

The analysis of data to extract insights is a crucial part of many scientific disciplines. Today, this analysis is increasingly being assisted by complex machine learning and artificial intelligence methods. Data that contains information about shape or location (which we call spatial data) motivates the use of specialized data science methods that draw on theory from topology and geometry. Such methods are designed to extract insights that might remain hidden to classical machine learning algorithms. Two important examples of spatial data are images (pixel values arranged in a grid) and geospatial datasets (where each data point has an associated location). A fast-growing area of mathematics which has shown promising results in analyzing spatial data is topological data analysis (TDA). TDA uses theory from algebraic topology to discover the “shape” of data by detecting clusters, holes or higher dimensional features. Dr. Weighill proposes projects in this general area for the participants in the REU to investigate. More specifically, the following topics are proposed: 

  • Creating a framework for the sensitivity analysis of topological summaries of geospatial data to mitigate measurement error and/or reveal outliers in the data. 
  • Demonstrate the effectiveness of topological data analysis methods for invariant machine learning on image data by benchmarking them against the state of the art on corrupted or deformed datasets. 
  • Increasing the robustness of topological data analysis methods on geospatial data by incorporating population-aware smoothing and diffusion operations, automatically identifying topological features that are robust against smoothing operations.  

These projects fit into Dr. Weighill’s broader program of using geometric and topological methods for analyzing spatial data; see for example the study of demographic data in Kauba and Weighill (2024). They also track recent trends in using TDA for challenging machine learning tasks such as texture classification (Chung et al., 2020). 

Students who undertake these projects need not arrive with a background in topology, but they should have some experience coding and some familiarity with graphs, vector spaces and basic statistics.