论文标题
嵌入对比的无监督功能以在损坏的图像数据集中群集内和分布式噪声
Embedding contrastive unsupervised features to cluster in- and out-of-distribution noise in corrupted image datasets
论文作者
论文摘要
创建图像数据集时,使用搜索引擎进行Web图像检索是手动策划的诱人替代方法,但是它们的主要缺点仍然是检索到的错误(嘈杂)样本的比例。以前的作品证明了这些嘈杂的样本是分布式(ID)样本的混合物,分配给了不正确的类别,但与数据集中的其他类别和分布范围(OOD)图像呈现了相似的视觉语义,这些图像与数据集中的任何类别没有任何语义相关性。实际上,后者是检索到的嘈杂图像的主要类型。为了解决这种噪声二元性,我们提出了一种两阶段算法,从检测步骤开始,在该步骤中,我们使用无监督的对比功能学习来表示特征空间中的图像。我们发现,对比度学习的比对和统一原则使OOD样品可以与单位孔隙单位的ID样品进行线性分离。然后,我们使用固定的邻域大小将无监督的表示形式嵌入,并在同类水平上应用异常敏感的聚类来检测清洁和OOD簇以及ID嘈杂的异常值。我们最终训练了噪声强大的神经网络,该网络将ID噪声纠正为正确的类别,并在具有指导性的对比目标中使用OOD样品,从而聚集它们以提高低级功能。我们的算法改善了合成噪声图像数据集的最新结果以及现实世界中的Web crawlecawlecawed数据。我们的工作完全可重现github.com/paulalbert31/sncf。
Using search engines for web image retrieval is a tempting alternative to manual curation when creating an image dataset, but their main drawback remains the proportion of incorrect (noisy) samples retrieved. These noisy samples have been evidenced by previous works to be a mixture of in-distribution (ID) samples, assigned to the incorrect category but presenting similar visual semantics to other classes in the dataset, and out-of-distribution (OOD) images, which share no semantic correlation with any category from the dataset. The latter are, in practice, the dominant type of noisy images retrieved. To tackle this noise duality, we propose a two stage algorithm starting with a detection step where we use unsupervised contrastive feature learning to represent images in a feature space. We find that the alignment and uniformity principles of contrastive learning allow OOD samples to be linearly separated from ID samples on the unit hypersphere. We then spectrally embed the unsupervised representations using a fixed neighborhood size and apply an outlier sensitive clustering at the class level to detect the clean and OOD clusters as well as ID noisy outliers. We finally train a noise robust neural network that corrects ID noise to the correct category and utilizes OOD samples in a guided contrastive objective, clustering them to improve low-level features. Our algorithm improves the state-of-the-art results on synthetic noise image datasets as well as real-world web-crawled data. Our work is fully reproducible github.com/PaulAlbert31/SNCF.