大量数据的多个插补：对收入动态小组研究的应用

论文标题

大量数据的多个插补：对收入动态小组研究的应用

Multiple Imputation with Massive Data: An Application to the Panel Study of Income Dynamics

论文作者

Si, Yajuan, Heeringa, Steve, Johnson, David, Little, Roderick, Liu, Wenshuo, Pfeffer, Fabian, Raghunathan, Trivellore

论文摘要

\多重插补（MI）是一种流行且完善的方法，用于处理多元数据集中缺少的数据，但是它用于大规模和复杂数据集的实用性已受到质疑。一个这样的数据集是对收入动态的小组研究（PSID），这是对美国的家庭收入和财富进行的长期且广泛的调查，目前，由于简单的实施，目前使用传统的热甲板方法来处理该调查的数据；但是，单变量热甲板会导致巨大的随机财富波动。 MI有效，但面临着操作挑战。我们使用软件iveware使用顺序回归/链式方程方法在2013年的PSID中乘以夹有横断面的财富数据，并将所得估算数据的分析与当前热甲板方法的分析进行比较。实际困难（例如非正态分布变量，跳过模式，具有多个级别的分类变量以及多重共线性）以及我们克服它们的方法。我们通过内部诊断和外部基准测试数据评估了插补质量和有效性。 MI通过帮助保留相关结构（例如PSID财富组成部分与家庭净资产与社会人口统计学因素之间的关系，并促进完成的数据分析以一般目的促进完成的数据分析），从而对现有热甲板方法产生改进。 MI将高度预测性的协变量纳入插补模型并提高效率。我们建议使用MI的实际实施，并期望当缺失的信息的一部分很大。

\Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the U.S. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/ chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and socio-demographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题