监督可视化数据探索

论文标题

监督可视化数据探索

Supervised Visualization for Data Exploration

论文作者

Rhodes, Jake S., Cutler, Adele, Wolf, Guy, Moon, Kevin R.

论文摘要

降低降低通常被用作数据探索的初始步骤，无论是用于分类或回归的预处理还是可视化。迄今为止，大多数降低技术的降低技术都是无监督的。他们不考虑类标签（例如PCA，MDS，T-SNE，ISOMAP）。这样的方法需要大量数据，并且通常对可能混淆数据中重要模式的噪声敏感。考虑到辅助注释（例如，类标签）的各种监督维度缩小方法的尝试已成功实施，具有提高分类精度或改进的数据可视化的目标。这些监督技术中的许多以相似性或相似性矩阵的形式将标签纳入了损失函数，从而在类簇之间产生了过度强调的分离，而类别簇之间的分离并不能实际代表数据中的局部和全局关系。另外，这些方法通常对参数调整敏感，如果没有明确的视觉优势概念，这可能很难配置。在本文中，我们描述了一种基于随机森林接近和基于扩散的维度降低的新型监督可视化技术。我们在定性和定量上都表明了我们方法在保留局部和全球结构中的优势，同时强调了低维嵌入中的重要变量。重要的是，我们的方法对噪声和参数调整是可靠的，因此在生成可靠的可视化数据探索时易于使用。

Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate important patterns in the data. Various attempts at supervised dimensionality reduction methods that take into account auxiliary annotations (e.g., class labels) have been successfully implemented with goals of increased classification accuracy or improved data visualization. Many of these supervised techniques incorporate labels in the loss function in the form of similarity or dissimilarity matrices, thereby creating over-emphasized separation between class clusters, which does not realistically represent the local and global relationships in the data. In addition, these approaches are often sensitive to parameter tuning, which may be difficult to configure without an explicit quantitative notion of visual superiority. In this paper, we describe a novel supervised visualization technique based on random forest proximities and diffusion-based dimensionality reduction. We show, both qualitatively and quantitatively, the advantages of our approach in retaining local and global structures in data, while emphasizing important variables in the low-dimensional embedding. Importantly, our approach is robust to noise and parameter tuning, thus making it simple to use while producing reliable visualizations for data exploration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题