论文标题

通过梯度下降学习具有对抗标签噪声的单个神经元

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

论文作者

Diakonikolas, Ilias, Kontonis, Vasilis, Tzamos, Christos, Zarifis, Nikos

论文摘要

我们研究学习单个神经元的基本问题,即,$ \ mathbf {x} \mapstoσ(\ Mathbf {w} \ cdot \ cdot \ mathbf {x})$单调激活$σ的函数$ l_2^2 $ -loss在存在对抗标签噪声的情况下。具体而言,我们将获得$(\ mathbf {x},y)\ in \ Mathbb {r}^d \ times \ times \ times \ mathbb {r} $的标记$ d $的标记示例$ f(\ mathbf {w}^\ ast)=ε$,其中$ f(\ mathbf {w})= \ m缩学习者的目的是输出假设向量$ \ mathbf {w} $,以便$ f(\ mathbb {w})= c \,ε$具有高概率,其中$ c> 1 $是通用常数。作为我们的主要贡献,我们为广泛的分布(包括对数 - 循环分布)和激活功能提供有效的恒定因子近似学习者。具体而言,对于各向同性对数凸出分布的类别,我们获得以下重要的推论: 对于逻辑激活,我们获得了第一个多项式时间常数因子近似(即使在高斯分布下)。我们的算法具有样品复杂性$ \ widetilde {o}(d/ε)$,这在多粒子因子中很紧。 对于relu激活,我们给出了一个有效的算法,带有样品复杂性$ \ tilde {o}(d \,\ polylog(1/ε))$。在我们工作之前,最著名的恒定因子近似学习者具有样本复杂性$ \tildeΩ(d/ε)$。 在这两个设置中,我们的算法很简单,在(正规)$ l_2^2 $ -loss上表现出梯度散发。我们的算法的正确性取决于我们确定的新结构结果,表明(本质上是基本上的)基础非凸损失的固定点大约是最佳的。

We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoσ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $σ:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=ε$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(σ(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, ε$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/ε)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/ε))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tildeΩ(d/ε)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源