Multiple Data Splitting for FDR Controls

Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for regression models is a fundamental problem with a long history. Researchers recently have proposed the idea of creating “knockoff” variables to control FDR. In contrast to creating knockoffs, a classical statistical idea is to introduce perturbations and examine the impacts, such as bootstrap. We here examine how the simple old idea of data splitting (DS) can be leveraged for controlling FDRs. As one may have anticipated, a DS procedure simply estimates two independent coefficients for each feature from two datasets of the half-size created by random splitting, and constructs a contrast statistic. The FDR control can be achieved by taking advantage of the statistic’s property that, for any null feature, its sampling distribution is symmetric about 0. Furthermore, by repeated sample splitting, we propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. Interestingly, MDS not only helps overcome the power loss caused by data splitting with the FDR still under control, but also results in a lower variance for the estimated FDR compared with all other methods in consideration. We prove that both DS and MDS can control FDR at the designated level asymptotically for canonical statistical models such as linear and Gaussian graphical models. DS and MDS are straightforward conceptually, easy to implement algorithmically, and also efficient computationally. They do not require any knowledge of the joint distribution of the covariates and are rather robust to certain violations of the theoretically required tail condition (e.g., t-distributions). Simulation results as well as a real data application show that both DS and MDS control the FDR well and MDS is often the most powerful method among all in consideration, especially when the signals are weak and correlations or partial correlations are high among the features. Our preliminary tests on nonlinear models such as generalized linear models and neural networks also show promises. The presentation is based on joint work with Chenguang Dai, Buyu Lin, and Xin Xing.

Prof. Jun Liu

Professor, Harvard University on November 20, 2020 at 11:45 AM in Zoom Webinar
Join Zoom Webinar

Jun Liu is Professor of Statistics at Harvard University, with a joint appointment in Harvard School of Public Health. Dr. Liu received his BS degree in mathematics in 1985 from Peking University and Ph.D. in statistics in 1991 from the University of Chicago. He held Assistant, Associate, and full professor positions at Stanford University from 1994 to 2003. Dr. Liu received the NSF CAREER Award in 1995 and the Mitchell Award in 2000. He was selected as a Medallion Lecturer in 2002, a Bernoulli Lecturer in 2004, a Kuwait Lecturer of Cambridge University in 2008; and elected to Fellow of the Institute of Mathematical Statistics and Fellow of the American Statistical Association in 2004 and 2005, respective. In 2002, he won the prestigious COPSS Presidents' Award (given annually to one individual under age 40). In 2010, he was awarded the Morningside Gold Medal in Applied Mathematics (once every 3 years to an individual of Chinese descent under age 45). He was honored with the Outstanding Achievement Award in 2012, and the Pao-Lu Hsu Award (once every 3 years) in 2016 by the International Chinese Statistical Association. In 2017, he was recognized by the Jerome Sacks Award for outstanding Cross-Disciplinary Research.

Dr. Liu and his collaborators introduced the statistical missing data formulation and Gibbs sampling strategies for biological sequence motif analysis in early 1990s. The resulting algorithms for protein sequence alignments, gene regulation analyses, and genetic studies have been adopted by many researchers as standard computational biology tools. Dr. Liu has made fundamental contributions to statistical computing and Bayesian modeling. He pioneered sequential Monte Carlo (SMC) methods and invented novel Markov chain Monte Carlo (MCMC) techniques. His theoretical and methodological studies on SMC and MCMC algorithms have had a broad impact in many areas. Dr. Liu has also pioneered novel Bayesian modeling techniques for discovering nonlinear and interactive effects in high-dimensional data. Dr. Liu has served on numerous government’s grant review panels and editorial boards of leading statistical journals, including the co-editorship of JASA from 2011-2014. Dr. Liu has published more than 250 research articles in leading scientific journals, mentioned more than 25 postdoctoral fellows, and graduated 30 Ph.D. students.

Interdisciplinary Distinguished Seminar Series

The Department of Electrical and Computer Engineering hosts a regularly scheduled seminar series with preeminent and leading reseachers in the US and the world, to help promote North Carolina as a center of innovation and knowledge and to ensure safeguarding its place of leading research.