论文标题
几乎没有图像分类基准离现实太远:通过语义任务采样更好地恢复原状
Few-Shot Image Classification Benchmarks are Too Far From Reality: Build Back Better with Semantic Task Sampling
论文作者
论文摘要
每天,都会发布一种新方法来解决几个图像分类,在学术基准测试中表现越来越更好。然而,我们观察到这些当前的基准并不能准确代表我们遇到的实际工业用例。在这项工作中,通过定性和定量研究,我们揭示了广泛使用的基准层层对由语义上非常不同的类别组成的任务强烈偏见,例如浴缸,白菜,披萨,Schipperke和Cardoon。这使得tieredimagenet(和类似的基准)无关紧要,无法评估模型解决现实生活中用例的能力,通常涉及更细粒度的分类。我们使用有关Tieredimagenet类别的语义信息来减轻这种偏见,并产生改进的平衡基准。进一步,我们还使用丹麦真菌2020数据集引入了一个新的基准测试,以进行几次图像分类。该基准提出了各种精致的评估任务。此外,该基准包括多路任务(例如,由100个班级组成),这是一个具有挑战性的环境,但在工业应用中非常普遍。我们的实验提出了任务的难度与其类之间的语义相似性之间的相关性,以及在许多方面的几乎没有射击分类中的最先进方法的大量性能下降,从而提出了有关这些方法扩展能力的问题。我们希望我们的工作将鼓励社区进一步质疑标准评估过程的质量及其与现实生活中的相关性。
Every day, a new method is published to tackle Few-Shot Image Classification, showing better and better performances on academic benchmarks. Nevertheless, we observe that these current benchmarks do not accurately represent the real industrial use cases that we encountered. In this work, through both qualitative and quantitative studies, we expose that the widely used benchmark tieredImageNet is strongly biased towards tasks composed of very semantically dissimilar classes e.g. bathtub, cabbage, pizza, schipperke, and cardoon. This makes tieredImageNet (and similar benchmarks) irrelevant to evaluate the ability of a model to solve real-life use cases usually involving more fine-grained classification. We mitigate this bias using semantic information about the classes of tieredImageNet and generate an improved, balanced benchmark. Going further, we also introduce a new benchmark for Few-Shot Image Classification using the Danish Fungi 2020 dataset. This benchmark proposes a wide variety of evaluation tasks with various fine-graininess. Moreover, this benchmark includes many-way tasks (e.g. composed of 100 classes), which is a challenging setting yet very common in industrial applications. Our experiments bring out the correlation between the difficulty of a task and the semantic similarity between its classes, as well as a heavy performance drop of state-of-the-art methods on many-way few-shot classification, raising questions about the scaling abilities of these methods. We hope that our work will encourage the community to further question the quality of standard evaluation processes and their relevance to real-life applications.