Events

Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

Haiying Wang

University of Connecticut
Colloquia

When

Date: Wednesday, February 9, 2022
Time: 4:00 pm - 5:00 pm
Location: Virtual through Zoom
HaiYing Wang is an Associate Professor in the Department of Statistics at the University of Connecticut. He was an Assistant Professor in the Department of Mathematics and Statistics at the University of New Hampshire from 2013 to 2017. He obtained his Ph.D. from the Department of Statistics at the University of Missouri in 2013, and his M.S. from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences in 2006. His research interests include informative subdata selection for big data, model selection, model averaging, measurement error models, optimal design, and semi-parametric regression.

We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data. We first prove that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive the asymptotic distribution of a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance. To further improve the estimation efficiency over the IPW method, we propose a likelihood-based estimator by correcting log odds for the sampled data and prove that the improved estimator has the smallest asymptotic variance among a large class of estimators. It is also more robust to pilot misspecification. We validate our approach on simulated data as well as a real click-through rate dataset with more than 0.3 trillion instances, collected over a period of a month. Both theoretical and empirical results demonstrate the effectiveness of our method.