低数据制度的非线性尺寸降低的PCA增强自动编码器

论文标题

低数据制度的非线性尺寸降低的PCA增强自动编码器

PCA-Boosted Autoencoders for Nonlinear Dimensionality Reduction in Low Data Regimes

论文作者

Al-Digeil, Muhammad, Grinberg, Yuri, Melati3, Daniele, Dezfouli, Mohsen Kamandar, Schmid, Jens H., Cheben, Pavel, Janz, Siegfried, Xu, Dan-Xia

论文摘要

自动编码器（AE）为降低非线性维度提供了一种有用的方法，但不适合低数据制度。相反，主成分分析（PCA）是数据效率的，但仅限于线性维度降低，当数据表现出固有的非线性时会提出问题。这在各种科学和工程领域（例如纳米光分组件设计）中提出了一个挑战，其中数据具有非线性特征，而由于昂贵的真实测量值或部分微分方程的资源消费解决方案，因此获得了非线性特征。为了解决这一困难，我们提出了一种利用两全其美的技术：一种利用PCA在稀缺的非线性数据上表现良好的自动编码器。具体而言，我们概述了基于数值的基于PCA的AE初始化，该初始化与参数化的Relu激活函数一起允许训练过程从精确的PCA解决方案开始并改进它。首先提出一个综合示例，以研究数据非线性和大小对所提出方法性能的影响。然后，我们对获得有用数据的几种纳米光分量设计问题进行评估。为了证明普遍性，我们还将其应用于其他科学领域的任务：基准乳腺癌数据集和基因表达数据集。我们表明，在我们考虑的大多数低数据制度案例中，我们所提出的方法比PCA和随机初始初始初始初始初始化AE好，或者至少与其他两种方法中的任何一种中的最佳均可相当。

Autoencoders (AE) provide a useful method for nonlinear dimensionality reduction but are ill-suited for low data regimes. Conversely, Principal Component Analysis (PCA) is data-efficient but is limited to linear dimensionality reduction, posing a problem when data exhibits inherent nonlinearity. This presents a challenge in various scientific and engineering domains such as the nanophotonic component design, where data exhibits nonlinear features while being expensive to obtain due to costly real measurements or resource-consuming solutions of partial differential equations. To address this difficulty, we propose a technique that harnesses the best of both worlds: an autoencoder that leverages PCA to perform well on scarce nonlinear data. Specifically, we outline a numerically robust PCA-based initialization of AE, which, together with the parameterized ReLU activation function, allows the training process to start from an exact PCA solution and improve upon it. A synthetic example is presented first to study the effects of data nonlinearity and size on the performance of the proposed method. We then evaluate our method on several nanophotonic component design problems where obtaining useful data is expensive. To demonstrate universality, we also apply it to tasks in other scientific domains: a benchmark breast cancer dataset and a gene expression dataset. We show that our proposed approach is substantially better than both PCA and randomly initialized AE in the majority of low-data regime cases we consider, or at least is comparable to the best of either of the other two methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题