论文标题
通过随机标记选择PCA中的组件数量
Selecting the number of components in PCA via random signflips
论文作者
论文摘要
主组件分析(PCA)是现代数据分析中的基础工具,PCA的关键步骤是选择要保留的组件数量。但是,经典的选择方法(例如,scree图,平行分析等)缺乏统计保证,在具有异质噪声的大维数据的日益普遍设置中,即每个条目可能具有不同的噪声方差。此外,事实证明,对于均质噪声非常有效的这些方法对于具有异质噪声的数据可能会极大地失败。本文提出了一种称为SignFlip并行分析(FLIPPA)的新方法,以设置大致对称噪声:它通过随机翻转每个条目的标志(以一半的概率)为例,将数据奇异值与“经验无效”矩阵的“经验无效”矩阵进行了比较。我们为Flippa开发了一种严格的理论,表明它具有非肌电I型误差控制,并且它始终选择在大维极限(即使噪声是异质的)以大于噪声层以上的信号的正确等级。我们还严格解释了为什么基于经典置换的平行分析在异质噪声下降低。最后,我们说明Flippa通过数值模拟与最先进的方法进行了比较,并示出了来自天文学的数据。
Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy.