我还需要多少数据？估计下游任务的要求

论文标题

我还需要多少数据？估计下游任务的要求

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

论文作者

Mahmood, Rafid, Lucas, James, Acuna, David, Li, Daiqing, Philion, Jonah, Alvarez, Jose M., Yu, Zhiding, Fidler, Sanja, Law, Marc T.

论文摘要

鉴于较小的培训数据集和学习算法，要达到目标验证或测试性能需要多少数据？这个问题至关重要。高估或低估数据需求会带来大量费用，而预算可以避免。关于神经缩放定律的先前工作表明，幂律函数可以符合验证性能曲线并将其推断为较大的数据集大小。我们发现，这并不能立即转化为估计所需数据集大小以满足目标性能的更困难的下游任务。在这项工作中，我们考虑了一系列的计算机视觉任务，并系统地研究了一个概括功能功能的功能家族，以便更好地估算数据需求。最后，我们表明，结合了调整的校正因子并在多个回合中收集，可以显着提高数据估计器的性能。使用我们的指南，从业人员可以准确估算机器学习系统的数据要求，以节省开发时间和数据获取成本。

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题