论文标题
包装是最佳的PAC学习者
Bagging is an Optimal PAC Learner
论文作者
论文摘要
在可实现的环境中确定PAC学习的最佳样本复杂性是数十年来学习理论中的一个核心开放问题。最后,Hanneke(2016)的开创性作品给出了一种算法,其样本复杂性是最佳的。他的算法是基于对培训数据的仔细且结构化的子采样,然后在对每个子样本的假设中返回多数票。尽管这是一个非常令人兴奋的理论结果,但它在实践中并没有太大影响,部分原因是效率低下,因为它构建了训练数据的多项式子样本数量,每个训练数据都有线性大小。 在这项工作中,我们证明了令人惊讶的结果是,由于Breiman(1996),实用和经典的启发式装袋(又称Bootstrap聚集)实际上也是最佳的PAC学习者。在大多数本科机器学习课程中都教授了汉纳克算法的预期算法。此外,我们表明它仅需要对数数量的子样本才能达到最佳性。
Determining the optimal sample complexity of PAC learning in the realizable setting was a central open problem in learning theory for decades. Finally, the seminal work by Hanneke (2016) gave an algorithm with a provably optimal sample complexity. His algorithm is based on a careful and structured sub-sampling of the training data and then returning a majority vote among hypotheses trained on each of the sub-samples. While being a very exciting theoretical result, it has not had much impact in practice, in part due to inefficiency, since it constructs a polynomial number of sub-samples of the training data, each of linear size. In this work, we prove the surprising result that the practical and classic heuristic bagging (a.k.a. bootstrap aggregation), due to Breiman (1996), is in fact also an optimal PAC learner. Bagging pre-dates Hanneke's algorithm by twenty years and is taught in most undergraduate machine learning courses. Moreover, we show that it only requires a logarithmic number of sub-samples to reach optimality.