Machine Learning for Causal Inference: Is a Nonlinear First Stage Really Forbidden in 2SLS?
Prof. Jing Peng
Associate Professor of Operations and Information Management
School of Business
University of Connecticut
The application of machine learning (ML) in causal inference has garnered significant attention from researchers. A particular focus lies in the integration of ML into two-stage least squares (2SLS), a cornerstone methodology for causal inference. While ML can significantly reduce the prediction error in the first stage, a major hurdle arises due to the concept of forbidden regression. Specifically, a nonlinear first stage is commonly deemed forbidden because the potential lack of orthogonality between the prediction and prediction error may lead to inconsistent estimates. To investigate the applicability of ML in 2SLS, this paper decomposes the bias of 2SLS into an observable bias and an unobservable bias, without specifying the functional form of the first stage or assuming the validity of the proposed instrument. Analytical results and extensive simulations show that while a linear prediction can ensure a zero observable bias, it may result in a substantial unobservable bias, especially when the instrument is weak or not strictly exogenous. Conversely, by utilizing constrained or orthogonalized ML predictions, it is possible, and even guaranteed under certain conditions, to reduce the unobservable bias without introducing an observable bias. This research establishes crucial theoretical foundations for the integration of ML into 2SLS.
Jing Peng is an Associate Professor of Operations and Information Management at the School of Business, University of Connecticut. He received his Ph.D. from the Wharton School, University of Pennsylvania. His research interests focus on e-commerce, social media, gig economy, digital health, and human-AI interaction. His work has appeared in Information Systems Research, Journal of Marketing Research, Management Science, MISQ Quarterly, and other outlets. He is very active in developing novel econometric methods and has contributed three R packages on methodologies to CRAN. His research has won multiple best paper awards. He is a recipient of the INFORMS Information Systems Society Gordon B. Davis Young Scholar Award.