MAY 24- JULY 30, 2021

Complex Data Analysis Using Statistical and Machine Learning Tools

Given the continued uncertainty surrounding resumption of face-to-face campus activities, the program will be run virtually, just like our highly successful program in the Summer of 2020 which included 9 students and 5 mentors.

UNC Greensboro, one of the campuses of the University of North Carolina System, will offer a 10-week REU program from May 24 – July 30, 2021 for 8 nationally recruited undergraduate students from mathematical sciences. The program is funded by an NSF grant DMS-1950549.

The focus of the program will be on Complex Data Analysis using Statistical and Machine Learning Tools. The eight students will be divided into 4-5 research teams of up to 2 students, each headed by a team of faculty Mentors. Students will be able to choose from a wide range of Projects covering topics such as high dimensional data analysis, subdata selection, machine learning, robust data analysis, and data confidentiality.

Emphasis during the training will be on both theory and applications. In addition to focused research on these topics, the program will offer participants a broad professional development training through various workshops and invited lectures. The program will also include plenty of social activities such as weekly virtual working dinners (paid for by the program), and casual conversations on a variety of topics

For Information, contact Program Director Dr. Sat Gupta by email at sngupta@uncg.edu, or the Mathematics and Statistics Department at math_sci@uncg.edu or kryoung3@uncg.edu. You may also call Katelyn at 336-334-5836.

Research Mentors

Dr. Sat Gupta (PI)- sngupta@uncg.edu

Dr. Sat Gupta will serve as PI for this project. Dr Gupta is a Professor and Head of the Department of Mathematics and Statistics at UNC Greensboro. He has earned PhD degrees in both Mathematicsand Statistics. He is a Fellow of the American Statistical Association and has won many awardsincluding the UNC Greensboro’s Senior Research Excellence Award (2017). His main area of research is Survey Sampling with particular interest in surveys involving sensitive topics. Included among his 135+ journal articles with students at all levels including undergraduate students. Dr Gupta was the Site PI for another ASA REU program in the Summer of 2018.

http://www.uncg.edu/~sngupta/

Dr. Xiaoli Gao (Co-PI)- x_gao2@uncg.edu

Dr. Xiaoli Gao will serve as Co-PI for this project. She received both M.S. and PhD degrees in Statistics from University of Iowa. Dr. Gao’s research explores both theoretical investigation of high-dimensional data analysis (HDDA) and its applications in biological and medical studies. In particular, she is interested in robust and complex HDDA, signal approximation and HD shrinkage analysis. During her academic career, Dr. Gao has received several research grants including principal investigator of a 5-year Simons Foundation Grant and UNCG Strategic Seed Grant {Community-Engaged Research and Creative Activity. She was also a Site Co-PI for UNCG ASA REU grant in 2018.

http://www.uncg.edu/~x_gao2/

John Stufken (Senior Personnel)- j_stufke@uncg.edu

Dr. John Stufken, a statistician, is Bank of America Excellence Professor and Director for the new MS degree program in Informatics and Analytics. He joined UNCG in 2019. His interests are in design of experiments, subdata selection, and data science. He is an elected Fellow of the American Statistical Association and of the Institute of Mathematical Statistics. He was Rothschild Distinguished Visiting Fellow at the Isaac Newton Institute of Mathematical Sciences in Cambridge, UK, and was the inaugural endowed Charles Wexler Professor of Statistics at Arizona State University. He provided leadership in statistics, at both the undergraduate and graduate levels, at Arizona State University (2014-19) and the University of Georgia (2003-14) as Coordinator of Statistics and Head of the Department of Statistics, respectively. Stufken’s research has been partially supported by the NSF throughout his career.

https://sites.google.com/uncg.edu/johnstufken/home

Scott Richter (Senior Personnel)- sjricht2@uncg.edu

Dr. Scott Richter has a PhD in Statistics (Oklahoma State University). He is Director of theStatistical Consulting Center at UNCG. His research involves nonparametric methods, especiallymethods using resampling. Dr. Richter has received extramural support multiple times as Senior Personnel to train young researchers, including REU and UMB programs sponsored by NSF.

http://www.uncg.edu/~sjricht2/

Jianping Sun (Senior Personnel)- j_sun4@uncg.edu

Dr. Jianping Sun is a statistician who has been working at UNCG since August 2018. She had postdoctoral and industry experiences before joining UNCG. She has interests in both statistical methodology and applied research in analyzing high-dimensional complex genomic data.

http://mathstats.uncg.edu/people/directory/sun

Rakhi Singh (Senior Personnel)- r_singh5@uncg.edu

Dr. Rakhi Singh received her Ph.D. in Statistics from IITB-Monash Research Academy in 2018, and then had a postdoc position at the Technical University of Dortmund, Germany. She is currently a postdoc at UNC Greensboro working with Dr. John Stufken. Her research interests include finding optimal designs for several choice experiment setups, supersaturated designs, coverings, etc. She is also interested in pursuing research in big data analytics including sampling and missing data problems. She also has an industry experience of working and managing a predictive analytics team for American Express. During her industry experience, she created several state-of-the-practice models using machine learning techniques, and helped the team with the implementation of these models.

https://sites.google.com/view/singhrakhi

Thomas Weighill (Senior Personnel)- t_weighill@uncg.edu

Dr. Thomas Weighill joined the Mathematics and Statistics Department at UNCG in 2021. Before coming to UNCG, he completed a postdoc at the MGGG Redistricting Lab at Tufts University under the supervision of Moon Duchin. Dr. Weighill’s research uses topological and geometric methods in data science, with a particular focus on geographic and election data.

https://sites.google.com/view/thomasweighill/

Research Projects

These are brief ideas. Actual projects may vary a little bit.

Jianping Sun
A Computational Efficient Method for Constructing Hierarchical Trees

Hierarchical trees are essential in genomic research, because they provide major tools for scientists to study evolution history and detect disease associated genomes. Fast developed sequencing technology, such as the next generation sequencing, has enabled researchers to obtain whole genome DNA sequences at a relatively low price. Hence there is an urgent need to develop statistical methods that can construct hierarchical trees from long DNA sequences by taking various biological complexities into account. In this project, we will develop a novel method for constructing hierarchical trees while accounting for significant evolution factors, such as mutation and recombination, simultaneously. A computational efficient algorithm will also be designed to accomplish the proposed methodology and make it practical when the length of sequence is large.

John Stufken & Rakhi Singh
Subdata Selection Methods

Data is everywhere and there is a huge amount of it. There are opportunities, more than ever, to answer relevant questions empirically by using the humongous amount of available data. For example, Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data. As another example, a transportation company like Uber has tremendous amounts of consumer preference data that they use to predict supply, demand, location of drivers, and fares that are set for every trip. Analyzing data of this size, if feasible at all, requires gigantic computational resources and the development of novel methods. But even for smaller sized big data, depending on the available computational platform and how often an analysis or exploration needs to be performed, the computational burden can be considerable. For that reason, methods have been developed to conduct an analysis based on only some of the data, referred to as subdata. The research question that we are interested in answering is which subdata is a good subdata when (a) prediction is the goal, and (b) a good representation of the population is the goal. In this project, we expect to come up with new methods of subdata selection while using big data techniques for analyzing the data from different subdata selection methods. Even though the statistical theory can be (should be) built around the procedures used, it is extremely hard, and we will try to answer most of the questions computationally. It is an ideal project for someone who has some knowledge of statistics and programming, has an interest in learning big data analysis techniques, and is passionate about coding (in R or Python – if you want to have access to some pre-existing codes).

Sat Gupta
Untruthful Responding in Randomized Response Models

Randomized response technique (RRT) models are important survey tools when dealing with potentially sensitive questions with legal or social implications. These models allow respondents to provide scrambled (noise-added) responses which are later unscrambled at an aggregate level but not at an individual level. Some of the respondents may not trust the privacy protection provided by RRT models and may still provide untruthful responses. While there is no way to identify and fix the untruthful responses, one can measure the level of untruthfulness and account for it in the final estimates. The 2020 REU group considered this problem in the binary response setting. We plan to carry that study forward and consider a similar problem for quantitative response situations. Untruthful responding is part of the larger problem of Measurement Error in statistical estimation.

Yu-Min Chung
Topological Data Analysis and Firn

Topological data analysis (TDA) is a rising field at the intersection of Mathematics, Statistics, and Machine Learning. Techniques from this field have proven successful in analyzing a variety of scientific problems and datasets. The main driving force in TDA is the development of persistent homology, which studies the intrinsic shape of data. The main goal of the project is to direct TDA tools at understanding microstructure and fluid flow in porous media. The main application of this project is the Firn, a type of ice core data. The project is interdisciplinary. Students will join a team of mathematician, statistician, and climatologist to work on the gas age-ice age problem in the climate science.

Thomas Weighill
Ranked Data Analysis of RCV elections

Ranked choice voting (RCV) is seeing a renewed surge of interest in the United States. For example, New York City will use ranked choice voting for local offices starting this year. While political scientists have long been in the business of modeling elections, applications of statistical models to RCV election data are still rare. This presents an exciting opportunity for new research at the intersection of statistics and political science. In this project, we will be investigating how to model RCV elections in which different voter groups are competing to elect their preferred representatives. This is especially relevant to minority representation in local government, an area where many believe RCV can have a strong positive impact in the near future.

Xiaoli Gao
Robust Classification of high-dimensional data with fuzzy group information

In cancer research and genetic studies, it is important to identify potential genomic biomarkers out of tens of thousands genetic features which are influential to certain phenotype. In this project, we will develop a robust penalized logistic regression model for simultaneous feature selection and classification with fuzzy group information in highdimensional data settings. The proposed approach will be applied to the gene expression data on pediatric acute myeloid leukemia (AML) prognosis. Ideally, students participating in the project should have background in linear algebra and linear regression, with programming experiences in R, Matlab or Python. It will be helpful for students considering this project to browse the references in Tibshirani (1996), Yuan and Lin (2006), Zhu and Hastie (2007), Gao (2016), Gao and Feng (2018).

Scott Richter
Combined Tests for Experiments with Matched-Pairs and Independent Samples Data

In many applications a matched-pairs design is used to control for variability between subjects and allow for more precise treatment comparisons. However, in many instances, missing data occur due to the inability to obtain one of the measurements. In these situations, a mixture of complete and incomplete pairs of data will be available.

Several approaches have been proposed to incorporate information from incomplete pairs. However, there is no procedure that is superior under all conditions, and the effect of several factors on the performance of these methods is not well understood. In this project we will study one or more of the following questions/topics.

What is the effect of unequal variances on the performance of the tests? The effect of unequal variance on the performance and properties of the proposed tests will be studied, and ways to mitigate negative effects of unequal variance investigated.
Develop a robust test to assess the assumption of equal variance. Ways to develop an effective test of equal variance will be investigated, and the properties and performance of the test studied.
Can a combined test statistic improve the power of the nonparametric tests? Ways to develop a test based on a combination of the test statistics will be considered and the properties and performance compared to those of the individual statistics.

Schedule

Statistics REU Schedule 2021

Eligibility Requirements

Minimum eligibility requirements include:

Must be a US Citizen or Permanent Resident (Green card)
Must have completed at least 30 College credits by the end of Spring 2021 with a minimum GPA of 3.0, preferably both math/stats and overall GPA.
Should have completed the following courses:
- A probability and statistics course
- A Calculus course
- Some programming experience, such as SAS, R, Python. We will reinforce programming skills through our own workshops

Application

Statistics REU 2021 Application Form

For full consideration, please email this completed application as soon as possible but latest by February 28, 2021 to Dr. Sat Gupta at sngupta@uncg.edu

Funding

Selected participants will get:

A stipend of \$600 per week for 10 weeks, for a total stipend of \$6,000.
Travel support of \$1,000 per student. This would include funds for travel to regional or national meetings to present their results during the academic year following their REU participation. These monies will be available on a competitive basis, and those students with the most promising results, and without other funding, will be supported.

REU Site in Computational Statistics