论文标题
关于监督信号的信息
On the Informativeness of Supervision Signals
论文作者
论文摘要
监督学习通常集中于从人类注释的训练示例中学习可转移的表示。虽然丰富的注释(如软标签)比稀疏注释(例如硬标签)具有更多信息,但收集它们也更昂贵。例如,尽管硬标签仅提供有关对象属于的最接近类的信息(例如,“这是狗”),但软标签可提供有关对象与多个类的关系的信息(例如,“这很可能是狗,但也可能是狼或土狼”)。我们使用信息理论来比较许多常用的监督信号如何有助于表示学习绩效,以及它们的容量如何受到标签,阶级,尺寸和噪声等因素的影响。我们的框架为在大数据制度中使用硬标签提供了理论上的理由,但是对几乎没有射击的学习和分布概括的富裕监督信号。我们在一系列超过100万个众包注释中进行了一系列实验验证这些结果,并进行了成本效益分析,以建立权衡曲线,使用户能够优化在自己的数据集中学习的代表性学习成本。
Supervised learning typically focuses on learning transferable representations from training examples annotated by humans. While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. For example, while hard labels only provide information about the closest class an object belongs to (e.g., "this is a dog"), soft labels provide information about the object's relationship with multiple classes (e.g., "this is most likely a dog, but it could also be a wolf or a coyote"). We use information theory to compare how a number of commonly-used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-distribution generalization. We validate these results empirically in a series of experiments with over 1 million crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.