基于自我代理的基于无监督的示例选择

论文标题

基于自我代理的基于无监督的示例选择

Self-Representation Based Unsupervised Exemplar Selection in a Union of Subspaces

论文作者

You, Chong, Li, Chi, Robinson, Daniel P., Vidal, Rene

论文摘要

在广泛的应用程序（例如数据集摘要和信息提取）中，从一个未标记的数据集中找到一小部分代表是一个核心问题。在假设数据点接近几个集群质心的假设下，诸如$ k $ medoids之类的经典示例选择方法可起作用，并且无法处理与子空间结合的数据的情况。本文提出了一个新的示例选择模型，该模型搜索了一个子集，该子集最能重建由表示系数的$ \ ell_1 $规范所测量的所有数据点。从几何上讲，该子集最能涵盖由子集的Minkowski函数测量的所有数据点。为了有效地解决我们的模型，我们引入了最远的第一个搜索算法，它迭代地选择了最坏的代表点作为示例。当数据集从独立子空间的结合中绘制时，我们的方法能够从每个子空间中选择足够多的代表。我们进一步开发了一种基于示例的子空间聚类方法，该方法对数据不平衡并有效地对大规模数据有效。此外，我们表明，在选定的示例上（标记时）训练的分类器可以正确分类其余数据点。

Finding a small set of representatives from an unlabeled dataset is a core problem in a broad range of applications such as dataset summarization and information extraction. Classical exemplar selection methods such as $k$-medoids work under the assumption that the data points are close to a few cluster centroids, and cannot handle the case where data lie close to a union of subspaces. This paper proposes a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the $\ell_1$ norm of the representation coefficients. Geometrically, this subset best covers all the data points as measured by the Minkowski functional of the subset. To solve our model efficiently, we introduce a farthest first search algorithm that iteratively selects the worst represented point as an exemplar. When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace. We further develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data. Moreover, we show that a classifier trained on the selected exemplars (when they are labeled) can correctly classify the rest of the data points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题