REU Site in Computational Statistics
May 22-July 28, 2023
Complex Data Analysis Using Statistical and Machine Learning Tools
UNC Greensboro, one of the campuses of the University of North Carolina System, will offer a 10-week REU program from May 22–July 28, 2023 for 10 nationally recruited undergraduate students from mathematical sciences. The program will be funded by NSF grant DMS-2244160 which we anticipate receiving.
The focus of the program will be on Complex Data Analysis using Statistical and Machine Learning Tools. The ten students will be divided into 4-5 research teams of up to 2-3 students, each headed by a team of faculty mentors. Students will be able to choose from a wide range of Projects covering topics such as high dimensional data analysis, machine learning, robust data analysis, genetics, topological data analysis, and data confidentiality.
Emphasis during the training will be on both theory and applications. In addition to focused research on these topics, the program will offer participants a broad professional development training through various workshops and invited lectures. The program will also include plenty of social activities such as weekly working lunches and occasional dinners (paid for by the program), and casual conversations on a variety of topics.
For Information, contact Program Director Dr. Sat Gupta by email at sngupta@uncg.edu, or the Mathematics and Statistics Department at mathstats@uncg.edu.
Research Mentors
Sat Gupta (PI) sngupta@uncg.edu
Dr. Sat Gupta will serve as PI for this project. Dr Gupta is a Professor in the Department of Mathematics and Statistics at UNC Greensboro. He has earned PhD degrees in both Mathematics and Statistics. He is a Fellow of the American Statistical Association and has won many awards including the UNC Greensboro’s Senior Research Excellence Award (2017). His main area of research is Survey Sampling with particular interest in surveys involving sensitive topics. Included among his 150+ journal articles are journal articles with students at all levels including undergraduate students. Dr Gupta was the Site PI for the previous REU grants also during Summers of 2018, and 2020-22.
Jianping Sun (Co-PI) j_sun4@uncg.edu
Dr. Jianping Sun is a statistician who has been working at UNCG since August 2018. She had postdoctoral and industry experiences before joining UNCG. She has interests in both statistical methodology and applied research in analyzing high-dimensional complex genomic data.
http://mathstats.uncg.edu/people/directory/sun
Sadia Khalil (Senior Personnel) s_khali2@uncg.edu
Dr. Sadia Khalil joined the Mathematics and Statistics Department at UNCG in 2022. She holds a PhD in Statistics from the National College of Business Administration and Economics (NCBA&E) Lahore, Pakistan (2017).
https://sites.google.com/view/sadia-khalil
Scott Richter (Senior Personnel) sjricht2@uncg.edu
Dr. Scott Richter has a PhD in Statistics (Oklahoma State University). He is Director of theStatistical Consulting Center at UNCG. His research involves nonparametric methods, especially methods using resampling. Dr. Richter has received extramural support multiple times as Senior Personnel to train young researchers, including REU and UMB programs sponsored by NSF.
http://www.uncg.edu/~sjricht2/
Thomas Weighill (Senior Personnel) t_weighill@uncg.edu
Dr. Thomas Weighill joined the Mathematics and Statistics Department at UNCG in 2021. Before coming to UNCG, he completed a postdoc at the MGGG Redistricting Lab at Tufts University under the supervision of Moon Duchin. Dr. Weighill’s research uses topological and geometric methods in data science, with a particular focus on geographic and election data.
Research Projects
2023 projects will be continuations of projects completed 2020–2022.
2022
An Application of Markov Chain Composite Likelihood in Recombination Model
Hierarchical evolution trees are critical in human genome research for investigating human evolution and identifying disease associated genetic markers. New high-throughput genome sequencing technologies raise an urgent need to develop statistical methods that can construct hierarchical evolution trees from long genome sequences with quick computation speeds, while considering various biological complexities. To this end, a recombination model has been developed recently and a Markov chain composite likelihood (MCCL) method is proposed to make model estimation computationally feasible. To further reduce computation complexity, in this project, a novel computation efficient algorithm, a left-to-right sequential estimator, will be designed and its performance will be evaluated through simulation studies for the potential of implementation in long sequence genome data.
Binary Randomized Response Technique Models Under Measurement Errors
In real-world surveys, measurement error is inevitable as the difference between the actual value of the variable being measured and its recorded value. Many authors in the field of Randomized Response Technique (RRT) have studied the impact of measurement error on quantitative RRT models, but there are no studies that examine impact of measurement errors in binary RRT models. In this study, we propose a binary RRT model under measurement error based on the previous work of Warner (1965). A simulation study is presented to validate the theoretical findings. Simulations show that the measurement error factor cannot be ignored when using binary RRT models, and that the proposed estimator for the binary RRT model under measurement error performs very well.
Detecting Spatial Dependence with Persistent Homology
We propose and demonstrate a topological test for spatial dependence based on the framework of persistent homology. We compare our method to Moran’s I, a classical measure of spatial auto-correlation, on synthetic datasets as well as on election and COVID data. We find about 65-75% agreement between the main variant of our method and Moran’s I on real datasets. While the Moran’s I test is more sensitive overall on these datasets, there are instructive instances (synthetic and real) where our method detects a spatial pattern that the Moran’s I test does not.
Adaptive Tests for Mixed Paired and Two-Sample Designs
This study proposes adaptive tests for mixed paired and two-sample designs using t-based and Wilcoxon-based tests. Previous simulation studies have found that t-based tests perform well for normal data, while Wilcoxon-based tests perform well for non-normal data. The proposed adaptive tests use two different tail index combination schemes to distinguish between normal and non-normal mixed pairs data to select the situationally more powerful test. A simulation study is conducted to estimate the power and Type I error rate of the proposed adaptive tests, compared to using their constituent tests uniformly. The proposed adaptive tests tended to have power comparable to the test that performed better under the particular distribution and provide a distribution-free approach when there are no assumptions or knowledge of the underlying distribution.
2021
A Computational Efficient Method for Constructing Hierarchical Trees
Hierarchical trees are essential in genomic research, because they provide major tools for scientists to study evolution history and detect disease associated genomes. Fast developed sequencing technology, such as the next generation sequencing, has enabled researchers to obtain whole genome DNA sequences at a relatively low price. Hence there is an urgent need to develop statistical methods that can construct hierarchical trees from long DNA sequences by taking various biological complexities into account. In this project, we will develop a novel method for constructing hierarchical trees while accounting for significant evolution factors, such as mutation and recombination, simultaneously. A computational efficient algorithm will also be designed to accomplish the proposed methodology and make it practical when the length of sequence is large.
Subdata Selection Methods
Data is everywhere and there is a huge amount of it. There are opportunities, more than ever, to answer relevant questions empirically by using the humongous amount of available data. For example, Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data. As another example, a transportation company like Uber has tremendous amounts of consumer preference data that they use to predict supply, demand, location of drivers, and fares that are set for every trip. Analyzing data of this size, if feasible at all, requires gigantic computational resources and the development of novel methods. But even for smaller sized big data, depending on the available computational platform and how often an analysis or exploration needs to be performed, the computational burden can be considerable. For that reason, methods have been developed to conduct an analysis based on only some of the data, referred to as subdata. The research question that we are interested in answering is which subdata is a good subdata when (a) prediction is the goal, and (b) a good representation of the population is the goal. In this project, we expect to come up with new methods of subdata selection while using big data techniques for analyzing the data from different subdata selection methods. Even though the statistical theory can be (should be) built around the procedures used, it is extremely hard, and we will try to answer most of the questions computationally. It is an ideal project for someone who has some knowledge of statistics and programming, has an interest in learning big data analysis techniques, and is passionate about coding (in R or Python – if you want to have access to some pre-existing codes).
Untruthful Responding in Randomized Response Models
Randomized response technique (RRT) models are important survey tools when dealing with potentially sensitive questions with legal or social implications. These models allow respondents to provide scrambled (noise-added) responses which are later unscrambled at an aggregate level but not at an individual level. Some of the respondents may not trust the privacy protection provided by RRT models and may still provide untruthful responses. While there is no way to identify and fix the untruthful responses, one can measure the level of untruthfulness and account for it in the final estimates. The 2020 REU group considered this problem in the binary response setting. We plan to carry that study forward and consider a similar problem for quantitative response situations. Untruthful responding is part of the larger problem of Measurement Error in statistical estimation.
Topological Data Analysis and Firn
Topological data analysis (TDA) is a rising field at the intersection of Mathematics, Statistics, and Machine Learning. Techniques from this field have proven successful in analyzing a variety of scientific problems and datasets. The main driving force in TDA is the development of persistent homology, which studies the intrinsic shape of data. The main goal of the project is to direct TDA tools at understanding microstructure and fluid flow in porous media. The main application of this project is the Firn, a type of ice core data. The project is interdisciplinary. Students will join a team of mathematician, statistician, and climatologist to work on the gas age-ice age problem in the climate science.
Ranked Data Analysis of RCV elections
Ranked choice voting (RCV) is seeing a renewed surge of interest in the United States. For example, New York City will use ranked choice voting for local offices starting this year. While political scientists have long been in the business of modeling elections, applications of statistical models to RCV election data are still rare. This presents an exciting opportunity for new research at the intersection of statistics and political science. In this project, we will be investigating how to model RCV elections in which different voter groups are competing to elect their preferred representatives. This is especially relevant to minority representation in local government, an area where many believe RCV can have a strong positive impact in the near future.
Robust Classification of High-dimensional Data with Fuzzy Group Information
In cancer research and genetic studies, it is important to identify potential genomic biomarkers out of tens of thousands genetic features which are influential to certain phenotype. In this project, we will develop a robust penalized logistic regression model for simultaneous feature selection and classification with fuzzy group information in highdimensional data settings. The proposed approach will be applied to the gene expression data on pediatric acute myeloid leukemia (AML) prognosis. Ideally, students participating in the project should have background in linear algebra and linear regression, with programming experiences in R, Matlab or Python. It will be helpful for students considering this project to browse the references in Tibshirani (1996), Yuan and Lin (2006), Zhu and Hastie (2007), Gao (2016), Gao and Feng (2018).
Combined Tests for Experiments with Matched-Pairs and Independent Samples Data
In many applications a matched-pairs design is used to control for variability between subjects and allow for more precise treatment comparisons. However, in many instances, missing data occur due to the inability to obtain one of the measurements. In these situations, a mixture of complete and incomplete pairs of data will be available.
Several approaches have been proposed to incorporate information from incomplete pairs. However, there is no procedure that is superior under all conditions, and the effect of several factors on the performance of these methods is not well understood. In this project we will study one or more of the following questions/topics.
- What is the effect of unequal variances on the performance of the tests? The effect of unequal variance on the performance and properties of the proposed tests will be studied, and ways to mitigate negative effects of unequal variance investigated.
- Develop a robust test to assess the assumption of equal variance. Ways to develop an effective test of equal variance will be investigated, and the properties and performance of the test studied.
- Can a combined test statistic improve the power of the nonparametric tests? Ways to develop a test based on a combination of the test statistics will be considered and the properties and performance compared to those of the individual statistics.
2020
A Cluster-based Approach to Feature Detection in Persistence Diagrams
Topological data analysis, and in particular persistence diagrams, are gaining popularity as tools for extracting topological information from noisy point cloud and digital data. Persistence diagrams track topological features in the form of kk-dimensional holes in the data. Here we construct an automated approach for identifying the features most likely due to noise so that they may be removed from estimates of the Betti numbers that give direct counts of holes of each dimension. This approach extends the established practice of using a lifespan cutoff on the features in order to take advantage of the observation that noisy features typically appear in clusters in the persistence diagram. We show that in many cases with high levels of noise, our method is an improvement over both a lifespan cutoff and the PD Thresholding technique. This work is motivated by 3-dimensional micro-CT imaging of ice core samples and is applicable for separating noise from signal in persistence diagrams from noisy data.
Topological Estimation of Image Data via Subsampling
We develop a novel statistical approach to estimate topological information from large, noisy images. Our main motivation is to measure pore microstructure in 3-dimensional X-ray micro-computed tomography (micro-CT) images of ice cores. The pore space in these samples is where gas can move and get trapped within the ice column and is of interest to climate scientists. While the field of topological data analysis offers tools (e.g. lifespan cutoff and PD Thresholding) for estimating topological information in noisy images, direct application of these techniques becomes infeasible as image size and noise levels grow. Our approach uses image subsampling to estimate the number of holes of a prescribed size range in a computationally feasible manner. In applications where holes naturally have a known size range on a smaller scale than the full image, this approach offers a means of estimating Betti numbers, or global counts of holes of various dimensions, via subsampling of the image.
Confidence Intervals and Improved Tests for Mixed Paired-Unpaired Data
We compare confidence intervals for permutation and asymptotic combined tests, and investigate the effect of ties on combined tests
R Simulations of a Unified Mixed- Effects Model
This research focuses on R simulations based on the article, “A Unified Mixed-Effects Model for Rare-Variant Association in Sequencing Studies.” The Mixed Effect Score Test (MiST) is a test to determine the association between a set of SNPS/genes and continuous or binary outcomes by including variant characteristic information and using (weighted) score statistics. Like other gene or region based tests, MiST evaluates the effects of multiple genetic variants in a gene or region by increasing power when multiple variants in the group are associated with a given disease or trait. This analysis compares many commonly used tests for rare variant associations, the Burden Test and the Sequence Kernel Association Test. We examined whether the MiST is more sensitive to type I error inflation. Our approach provides in-depth insight into the general testing framework of the MiST package. We used simulations under a wide range of scenarios to determine if error distributions will affect type I error. In particular, we consider three different distributions for our simulations; normal distribution, t distribution, and gamma distribution. For each distribution, we compared type I error rates at varying significance levels. Results from the study found that if the error is not normal distributed, then this method will result in an inflated type I error.
A Mixture Binary RRT Model with a Unified Measure of Privacy and Efficiency
In this study, we introduce a mixture binary Randomized Response Technique (RRT) model by combining the elements of the Greenberg Unrelated Question model and the Warner Indirect Question model. RRT models are very useful both at the data acquisition stage and at the data release stage because they provide respondent privacy and data security. We account for untruthful responding in the proposed model. A unified measure of model efficiency and respondent privacy is also presented. Finally, we present a simulation study to validate the theoretical findings.
Penalized Weighted Fuzzy Group Variable Selection with Applications in Multi-Omic Integrative Data Analysis
In this project, a high-dimensional penalized weighted regression model (PWR) is considered for simultaneous outlier detection, robust regression and variable selection. In real applications, the data can be irregular due to the contamination from outliers or leverage points and the existence of heteroskedasticity. This phenomenon becomes more common in high-dimensional settings when a large number of predictors are collected. Because of the co-existence of high-dimensionality and data contamination, simultaneous outlier detection and variable selection become important issues. This method has been successfully applied in genetics such as in copy number variation.
Eligibility Requirements
Minimum eligibility requirements include:
- Must be a US Citizen or Permanent Resident (Green card)
- Must have completed at least 30 College credits by the end of Spring 2023 with a minimum GPA of 3.0, preferably both math/stats and overall GPA.
- Should have completed the following courses:
- A probability and statistics course
- A Calculus course
- Some programming experience, such as SAS, R, Python. We will reinforce programming skills through our own workshops
Application
Statistics REU 2023 Application Form
Review of applications will begin on February 28, 2023 and will continue until all of the 12 slots are filled.
Email application and supporting documents to sngupta@uncg.edu
Funding
Selected participants will get:
- A stipend of \$600 per week for 10 weeks, for a total stipend of \$6,000.
- Travel support of \$1,100 per student. This would include funds for travel to regional or national meetings to present their results during the academic year following their REU participation. These monies will be available on a competitive basis, and those students with the most promising results, and without other funding, will be supported.
- Students will receive \$600 meal allowance + \$2,240 housing costs (\$32/night x 70 nights) + \$300 for working meals = \$3,140. Participants may use meal money to eat at on-campus dining halls, off-campus venues, or to purchase groceries. Housing is provided at competitive rates in university-owned fully furnished apartments on campus dormitories in single rooms with a shared lobby and fully furnished kitchen, where participants will have easy access to classrooms, libraries, laboratories, dining halls, and other campus facilities. Two working (common) meals are planned each week (one lunch, one supper) at an estimated cost of \$30 per student per week.