学习通过神经网络推理：概括，看不见的数据和布尔措施

论文标题

学习通过神经网络推理：概括，看不见的数据和布尔措施

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

论文作者

Abbe, Emmanuel, Bengio, Samy, Cornacchia, Elisabetta, Kleinberg, Jon, Lotfi, Aryo, Raghu, Maithra, Zhang, Chiyuan

论文摘要

本文考虑了[ZRKB21]中引入的指针值检索（PVR）基准，其中“推理”功能作用于数字字符串以产生标签。更一般而言，本文考虑了在神经网络上使用梯度下降（GD）的逻辑功能的学习。首先显示，为了在对称神经网络上学习逻辑函数，可以根据目标函数的噪声稳定性来较低限制概括，从而支持[ZRKB21]中的猜想。然后表明，在分布移位设置中，当数据预扣符对应于冻结单个功能（称为规范持有）时，梯度下降的概括误差承认，对几种相关体系结构的布尔影响，梯度下降的概述承认了严格的表征。这在线性模型上显示，并在其他模型（例如MLP和变压器）上进行了实验支持。特别是，这提出了以下假设：对于此类架构和学习逻辑功能（例如PVR函数），GD倾向于对低度表示有隐性的偏见，这反过来又为二次损失下的概括误差带来了布尔的影响。

This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.

下载PDF全文

下载文献需遵守相关版权规定

论文标题