May 23-July 29, 2022

Complex Data Analysis Using Statistical and Machine Learning Tools

To be held virtually

UNC Greensboro, one of the campuses of the University of North Carolina System, will offer a 10-week REU program from May 23 – July 29, 2022 for 9 nationally recruited undergraduate students from mathematical sciences. The program is funded by an NSF grant DMS-1950549.

The focus of the program will be on Complex Data Analysis using Statistical and Machine Learning Tools. The eight students will be divided into 4-5 research teams of up to 2 students, each headed by a team of faculty Mentors. Students will be able to choose from a wide range of Projects covering topics such as high dimensional data analysis, subdata selection, machine learning, robust data analysis, and data confidentiality.

Emphasis during the training will be on both theory and applications. In addition to focused research on these topics, the program will offer participants a broad professional development training through various workshops and invited lectures. The program will also include plenty of social activities such as weekly virtual working lunches and occasional dinners (paid for by the program), and casual conversations on a variety of topics.

For Information, contact Program Director Dr. Sat Gupta by email at sngupta@uncg.edu, or the Mathematics and Statistics Department at math_sci@uncg.edu.

Research Mentors

Sat Gupta (PI)- sngupta@uncg.edu

Dr. Sat Gupta will serve as PI for this project. Dr Gupta is a Professor and Head of the Department of Mathematics and Statistics at UNC Greensboro. He has earned PhD degrees in both Mathematicsand Statistics. He is a Fellow of the American Statistical Association and has won many awardsincluding the UNC Greensboro’s Senior Research Excellence Award (2017). His main area of research is Survey Sampling with particular interest in surveys involving sensitive topics. Included among his 135+ journal articles with students at all levels including undergraduate students. Dr Gupta was the Site PI for another ASA REU program in the Summer of 2018.

http://www.uncg.edu/~sngupta/

Xiaoli Gao (Co-PI)- x_gao2@uncg.edu

Dr. Xiaoli Gao will serve as Co-PI for this project. She received both M.S. and PhD degrees in Statistics from University of Iowa. Dr. Gao’s research explores both theoretical investigation of high-dimensional data analysis (HDDA) and its applications in biological and medical studies. In particular, she is interested in robust and complex HDDA, signal approximation and HD shrinkage analysis. During her academic career, Dr. Gao has received several research grants including principal investigator of a 5-year Simons Foundation Grant and UNCG Strategic Seed Grant {Community-Engaged Research and Creative Activity. She was also a Site Co-PI for UNCG ASA REU grant in 2018.

http://www.uncg.edu/~x_gao2/

John Stufken (Senior Personnel)- j_stufke@uncg.edu

Dr. John Stufken, a statistician, is Bank of America Excellence Professor and Director for the new MS degree program in Informatics and Analytics. He joined UNCG in 2019. His interests are in design of experiments, subdata selection, and data science. He is an elected Fellow of the American Statistical Association and of the Institute of Mathematical Statistics. He was Rothschild Distinguished Visiting Fellow at the Isaac Newton Institute of Mathematical Sciences in Cambridge, UK, and was the inaugural endowed Charles Wexler Professor of Statistics at Arizona State University. He provided leadership in statistics, at both the undergraduate and graduate levels, at Arizona State University (2014-19) and the University of Georgia (2003-14) as Coordinator of Statistics and Head of the Department of Statistics, respectively. Stufken’s research has been partially supported by the NSF throughout his career.

https://sites.google.com/view/john-stufken

Scott Richter (Senior Personnel)- sjricht2@uncg.edu

Dr. Scott Richter has a PhD in Statistics (Oklahoma State University). He is Director of theStatistical Consulting Center at UNCG. His research involves nonparametric methods, especiallymethods using resampling. Dr. Richter has received extramural support multiple times as Senior Personnel to train young researchers, including REU and UMB programs sponsored by NSF.

http://www.uncg.edu/~sjricht2/

Jianping Sun (Senior Personnel)- j_sun4@uncg.edu

Dr. Jianping Sun is a statistician who has been working at UNCG since August 2018. She had postdoctoral and industry experiences before joining UNCG. She has interests in both statistical methodology and applied research in analyzing high-dimensional complex genomic data.

http://mathstats.uncg.edu/people/directory/sun

Rakhi Singh (Senior Personnel)- r_singh5@uncg.edu

Dr. Rakhi Singh, a statistician, is a postdoc at UNC Greensboro working with Dr. John Stufken since Jan 2020. She is interested in the design of experiments, subdata selection, and data science with her research essentially striking a balance between statistical and computational efficiency by offering novel methods that are theoretically founded and computationally feasible. Prior to joining UNCG, she received her Ph.D. in Statistics in 2018 and did a year-long postdoc in Germany in 2019. During her predictive analytics industry experience in American Express, she created several state-of-the-practice models using machine learning techniques and helped the team with the implementation of these models winning herself a Chairman’s Excellence Award for Innovation in 2013.

https://sites.google.com/view/singhrakhi

Thomas Weighill (Senior Personnel)- t_weighill@uncg.edu

Dr. Thomas Weighill joined the Mathematics and Statistics Department at UNCG in 2021. Before coming to UNCG, he completed a postdoc at the MGGG Redistricting Lab at Tufts University under the supervision of Moon Duchin. Dr. Weighill’s research uses topological and geometric methods in data science, with a particular focus on geographic and election data.

https://sites.google.com/view/thomasweighill/

Research Projects

2022 projects will be a continuation of the 2020/2021 listed below:

2021

A Computational Efficient Method for Constructing Hierarchical Trees

Hierarchical trees are essential in genomic research, because they provide major tools for scientists to study evolution history and detect disease associated genomes. Fast developed sequencing technology, such as the next generation sequencing, has enabled researchers to obtain whole genome DNA sequences at a relatively low price. Hence there is an urgent need to develop statistical methods that can construct hierarchical trees from long DNA sequences by taking various biological complexities into account. In this project, we will develop a novel method for constructing hierarchical trees while accounting for significant evolution factors, such as mutation and recombination, simultaneously. A computational efficient algorithm will also be designed to accomplish the proposed methodology and make it practical when the length of sequence is large.

Subdata Selection Methods

Data is everywhere and there is a huge amount of it. There are opportunities, more than ever, to answer relevant questions empirically by using the humongous amount of available data. For example, Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data. As another example, a transportation company like Uber has tremendous amounts of consumer preference data that they use to predict supply, demand, location of drivers, and fares that are set for every trip. Analyzing data of this size, if feasible at all, requires gigantic computational resources and the development of novel methods. But even for smaller sized big data, depending on the available computational platform and how often an analysis or exploration needs to be performed, the computational burden can be considerable. For that reason, methods have been developed to conduct an analysis based on only some of the data, referred to as subdata. The research question that we are interested in answering is which subdata is a good subdata when (a) prediction is the goal, and (b) a good representation of the population is the goal. In this project, we expect to come up with new methods of subdata selection while using big data techniques for analyzing the data from different subdata selection methods. Even though the statistical theory can be (should be) built around the procedures used, it is extremely hard, and we will try to answer most of the questions computationally. It is an ideal project for someone who has some knowledge of statistics and programming, has an interest in learning big data analysis techniques, and is passionate about coding (in R or Python – if you want to have access to some pre-existing codes).

Untruthful Responding in Randomized Response Models

Randomized response technique (RRT) models are important survey tools when dealing with potentially sensitive questions with legal or social implications. These models allow respondents to provide scrambled (noise-added) responses which are later unscrambled at an aggregate level but not at an individual level. Some of the respondents may not trust the privacy protection provided by RRT models and may still provide untruthful responses. While there is no way to identify and fix the untruthful responses, one can measure the level of untruthfulness and account for it in the final estimates. The 2020 REU group considered this problem in the binary response setting. We plan to carry that study forward and consider a similar problem for quantitative response situations. Untruthful responding is part of the larger problem of Measurement Error in statistical estimation.

Topological Data Analysis and Firn

Topological data analysis (TDA) is a rising field at the intersection of Mathematics, Statistics, and Machine Learning. Techniques from this field have proven successful in analyzing a variety of scientific problems and datasets. The main driving force in TDA is the development of persistent homology, which studies the intrinsic shape of data. The main goal of the project is to direct TDA tools at understanding microstructure and fluid flow in porous media. The main application of this project is the Firn, a type of ice core data. The project is interdisciplinary. Students will join a team of mathematician, statistician, and climatologist to work on the gas age-ice age problem in the climate science.

Ranked Data Analysis of RCV elections

Ranked choice voting (RCV) is seeing a renewed surge of interest in the United States. For example, New York City will use ranked choice voting for local offices starting this year. While political scientists have long been in the business of modeling elections, applications of statistical models to RCV election data are still rare. This presents an exciting opportunity for new research at the intersection of statistics and political science. In this project, we will be investigating how to model RCV elections in which different voter groups are competing to elect their preferred representatives. This is especially relevant to minority representation in local government, an area where many believe RCV can have a strong positive impact in the near future.

Robust Classification of high-dimensional data with fuzzy group information

In cancer research and genetic studies, it is important to identify potential genomic biomarkers out of tens of thousands genetic features which are influential to certain phenotype. In this project, we will develop a robust penalized logistic regression model for simultaneous feature selection and classification with fuzzy group information in highdimensional data settings. The proposed approach will be applied to the gene expression data on pediatric acute myeloid leukemia (AML) prognosis. Ideally, students participating in the project should have background in linear algebra and linear regression, with programming experiences in R, Matlab or Python. It will be helpful for students considering this project to browse the references in Tibshirani (1996), Yuan and Lin (2006), Zhu and Hastie (2007), Gao (2016), Gao and Feng (2018).

Combined Tests for Experiments with Matched-Pairs and Independent Samples Data

In many applications a matched-pairs design is used to control for variability between subjects and allow for more precise treatment comparisons. However, in many instances, missing data occur due to the inability to obtain one of the measurements. In these situations, a mixture of complete and incomplete pairs of data will be available.

Several approaches have been proposed to incorporate information from incomplete pairs. However, there is no procedure that is superior under all conditions, and the effect of several factors on the performance of these methods is not well understood. In this project we will study one or more of the following questions/topics.

What is the effect of unequal variances on the performance of the tests? The effect of unequal variance on the performance and properties of the proposed tests will be studied, and ways to mitigate negative effects of unequal variance investigated.
Develop a robust test to assess the assumption of equal variance. Ways to develop an effective test of equal variance will be investigated, and the properties and performance of the test studied.
Can a combined test statistic improve the power of the nonparametric tests? Ways to develop a test based on a combination of the test statistics will be considered and the properties and performance compared to those of the individual statistics.

2020

A cluster-based approach to feature detection in persistence diagrams

Topological data analysis, and in particular persistence diagrams, are gaining popularity as tools for extracting topological information from noisy point cloud and digital data. Persistence diagrams track topological features in the form of $k$ -dimensional holes in the data. Here we construct an automated approach for identifying the features most likely due to noise so that they may be removed from estimates of the Betti numbers that give direct counts of holes of each dimension. This approach extends the established practice of using a lifespan cutoff on the features in order to take advantage of the observation that noisy features typically appear in clusters in the persistence diagram. We show that in many cases with high levels of noise, our method is an improvement over both a lifespan cutoff and the PD Thresholding technique. This work is motivated by 3-dimensional micro-CT imaging of ice core samples and is applicable for separating noise from signal in persistence diagrams from noisy data.

Topological estimation of image data via subsampling

We develop a novel statistical approach to estimate topological information from large, noisy images. Our main motivation is to measure pore microstructure in 3-dimensional X-ray micro-computed tomography (micro-CT) images of ice cores. The pore space in these samples is where gas can move and get trapped within the ice column and is of interest to climate scientists. While the field of topological data analysis offers tools (e.g. lifespan cutoff and PD Thresholding) for estimating topological information in noisy images, direct application of these techniques becomes infeasible as image size and noise levels grow. Our approach uses image subsampling to estimate the number of holes of a prescribed size range in a computationally feasible manner. In applications where holes naturally have a known size range on a smaller scale than the full image, this approach offers a means of estimating Betti numbers, or global counts of holes of various dimensions, via subsampling of the image.

Confidence Intervals and Improved Tests for Mixed Paired-Unpaired Data

We compare confidence intervals for permutation and asymptotic combined tests, and investigate the effect of ties on combined tests

R Simulations of a Unified Mixed- Effects Model

This research focuses on R simulations based on the article, “A Unified Mixed-Effects Model for Rare-Variant Association in Sequencing Studies.” The Mixed Effect Score Test (MiST) is a test to determine the association between a set of SNPS/genes and continuous or binary outcomes by including variant characteristic information and using (weighted) score statistics. Like other gene or region based tests, MiST evaluates the effects of multiple genetic variants in a gene or region by increasing power when multiple variants in the group are associated with a given disease or trait. This analysis compares many commonly used tests for rare variant associations, the Burden Test and the Sequence Kernel Association Test. We examined whether the MiST is more sensitive to type I error inflation. Our approach provides in-depth insight into the general testing framework of the MiST package. We used simulations under a wide range of scenarios to determine if error distributions will affect type I error. In particular, we consider three different distributions for our simulations; normal distribution, t distribution, and gamma distribution. For each distribution, we compared type I error rates at varying significance levels. Results from the study found that if the error is not normal distributed, then this method will result in an inflated type I error.

A Mixture Binary RRT Model with A Unified Measure of Privacy and Efficiency

In this study, we introduce a mixture binary Randomized Response Technique (RRT) model by combining the elements of the Greenberg Unrelated Question model and the Warner Indirect Question model. RRT models are very useful both at the data acquisition stage and at the data release stage because they provide respondent privacy and data security. We account for untruthful responding in the proposed model. A unified measure of model efficiency and respondent privacy is also presented. Finally, we present a simulation study to validate the theoretical findings.

Penalized Weighted Fuzzy Group Variable Selection with Applications in Multi-Omic Integrative Data Analysis

In this project, a high-dimensional penalized weighted regression model (PWR) is considered for simultaneous outlier detection, robust regression and variable selection. In real applications, the data can be irregular due to the contamination from outliers or leverage points and the existence of heteroskedasticity. This phenomenon becomes more common in high-dimensional settings when a large number of predictors are collected. Because of the co-existence of high-dimensionality and data contamination, simultaneous outlier detection and variable selection become important issues. This method has been successfully applied in genetics such as in copy number variation.

Schedule

REU 2022 Schedule

Eligibility Requirements

Minimum eligibility requirements include:

Must be a US Citizen or Permanent Resident (Green card)
Must have completed at least 30 College credits by the end of Spring 2022 with a minimum GPA of 3.0, preferably both math/stats and overall GPA.
Should have completed the following courses:
- A probability and statistics course
- A Calculus course
- Some programming experience, such as SAS, R, Python. We will reinforce programming skills through our own workshops

Application

Statistics REU 2022 Application Form

Deadline: February 28, 2022

Email application and supporting documents to sngupta@uncg.edu

Funding

Selected participants will get:

A stipend of \$600 per week for 10 weeks, for a total stipend of \$6000.
Travel support of \$1000 per student. This would include funds for travel to regional or national meetings to present their results during the academic year following their REU participation. These monies will be available on a competitive basis, and those students with the most promising results, and without other funding, will be supported.

REU Site in Computational Statistics