通过梯度下降学习具有对抗标签噪声的单个神经元

论文标题

通过梯度下降学习具有对抗标签噪声的单个神经元

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

论文作者

Diakonikolas, Ilias, Kontonis, Vasilis, Tzamos, Christos, Zarifis, Nikos

论文摘要

我们研究学习单个神经元的基本问题，即，$ \ mathbf {x} \mapstoσ（\ Mathbf {w} \ cdot \ cdot \ mathbf {x}）$单调激活$σ的函数$ l_2^2 $ -loss在存在对抗标签噪声的情况下。具体而言，我们将获得$（\ mathbf {x}，y）\ in \ Mathbb {r}^d \ times \ times \ times \ mathbb {r} $的标记$ d $的标记示例$ f（\ mathbf {w}^\ ast）=ε$，其中$ f（\ mathbf {w}）= \ m缩学习者的目的是输出假设向量$ \ mathbf {w} $，以便$ f（\ mathbb {w}）= c \，ε$具有高概率，其中$ c> 1 $是通用常数。作为我们的主要贡献，我们为广泛的分布（包括对数 - 循环分布）和激活功能提供有效的恒定因子近似学习者。具体而言，对于各向同性对数凸出分布的类别，我们获得以下重要的推论：对于逻辑激活，我们获得了第一个多项式时间常数因子近似（即使在高斯分布下）。我们的算法具有样品复杂性$ \ widetilde {o}（d/ε）$，这在多粒子因子中很紧。对于relu激活，我们给出了一个有效的算法，带有样品复杂性$ \ tilde {o}（d \，\ polylog（1/ε））$。在我们工作之前，最著名的恒定因子近似学习者具有样本复杂性$ \tildeΩ（d/ε）$。在这两个设置中，我们的算法很简单，在（正规）$ l_2^2 $ -loss上表现出梯度散发。我们的算法的正确性取决于我们确定的新结构结果，表明（本质上是基本上的）基础非凸损失的固定点大约是最佳的。

We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoσ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $σ:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=ε$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(σ(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, ε$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/ε)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/ε))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tildeΩ(d/ε)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题