论文标题
数据越多,越好?在线性回归中删除基于删除的方法,而缺少数据
The More Data, the Better? Demystifying Deletion-Based Methods in Linear Regression with Missing Data
论文作者
论文摘要
我们比较了两种基于缺失的方法来处理线性回归分析中缺少观察结果的问题。一种是完整的分析(CC或Listwise删除),它丢弃了所有不完整的观测值,仅使用常见样品进行普通最小二乘估计。另一个是利用所有可用数据估算协方差矩阵并应用这些矩阵来构建正常方程的可用案例分析(AC或成对删除)。我们表明,来自两种方法的估计值渐近公正,并在某些典型情况下进一步比较它们的渐近方差。令人惊讶的是,在许多情况下,使用更多数据(即AC)并不一定会带来更好的渐近效率。缺少模式,协方差结构和真实的回归系数值都在确定哪个更好。我们进一步进行仿真研究,以证实发现并揭开文献中错过或误解的信息。在线补充材料中提供了一些详细的证明和仿真结果。
We compare two deletion-based methods for dealing with the problem of missing observations in linear regression analysis. One is the complete-case analysis (CC, or listwise deletion) that discards all incomplete observations and only uses common samples for ordinary least-squares estimation. The other is the available-case analysis (AC, or pairwise deletion) that utilizes all available data to estimate the covariance matrices and applies these matrices to construct the normal equation. We show that the estimates from both methods are asymptotically unbiased and further compare their asymptotic variances in some typical situations. Surprisingly, using more data (i.e., AC) does not necessarily lead to better asymptotic efficiency in many scenarios. Missing patterns, covariance structure and true regression coefficient values all play a role in determining which is better. We further conduct simulation studies to corroborate the findings and demystify what has been missed or misinterpreted in the literature. Some detailed proofs and simulation results are available in the online supplemental materials.