关于优化器选择的实证研究，用于分布概括

论文标题

关于优化器选择的实证研究，用于分布概括

Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

论文作者

Naganuma, Hiroki, Ahuja, Kartik, Takagi, Shiro, Motokawa, Tetsuya, Yokota, Rio, Ishikawa, Kohta, Sato, Ikuro, Mitliagkas, Ioannis

论文摘要

当测试数据分布与培训数据分布略有不同时，现代深度学习系统并不能很好地概括。尽管已经完成了许多有希望的工作来解决这一脆弱性，但尚未进行对优化者及其分布外概括性能的系统研究。在这项研究中，我们研究了在经验风险最小化和不变风险最小化下，流行的一阶优化器对不同类别的分布转移的性能。我们以域床位，野外和背景挑战作为研究不同类型的偏移的测试床（即相关性和多样性转移）来解决图像和文本分类。我们搜索各种超参数，并检查分类准确性（分配和分布式）以上的20,000多个型号。我们得出以下发现，我们希望这对从业者有帮助：i）自适应优化者（例如，亚当）的表现要比非自适应优化者（例如SGD，动量SGD）在分布外的性能方面差。特别是，即使分布性能没有显着差异，我们也表现出可测量的分布性能差异。 ii）分布性能和分布外的性能根据数据集展示了三种类型的行为 - 线性回报，收益率增加和回报率降低。例如，在使用亚当培训自然语言数据时，对分布性能的性能进行微调并没有显着促进分布概括性能。

Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts -- namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset -- linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题