论文标题
一项研究“我们需要的高质量数据吗?”的建议。
A Proposal to Study "Is High Quality Data All We Need?"
论文作者
论文摘要
即使深层神经模型在许多流行的基准上都取得了超人的性能,但它们未能推广到OOD或对抗数据集。旨在提高鲁棒性的常规方法包括开发越来越大的模型和大规模数据集的增强。但是,与这些趋势正交,我们假设我们需要一个较小,高质量的数据集。我们的假设是基于以下事实:深神经网络是数据驱动的模型,而数据是导致/误导模型的原因。在这项工作中,我们提出了一项实证研究,该研究研究了如何选择和/或创建高质量基准数据的子集,以供模型有效学习。我们寻求回答是否确实需要大数据集来学习任务,以及较小的高质量数据子集是否可以替代大数据集。我们计划研究数据修剪和数据创建范式以生成高质量的数据集。
Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation paradigms to generate high quality datasets.