FP-NAS：快速概率的神经建筑搜索

论文标题

FP-NAS：快速概率的神经建筑搜索

FP-NAS: Fast Probabilistic Neural Architecture Search

论文作者

Yan, Zhicheng, Dai, Xiaoliang, Zhang, Peizhao, Tian, Yuandong, Wu, Bichen, Feiszli, Matt

论文摘要

差异神经体系结构搜索（NAS）需要同时保存所有层选择；这限制了搜索空间和最终体系结构的大小。相反，概率NAS（例如Parsec）在高性能体系结构上学习了分布，并且仅使用所需的尽可能多的内存来训练单个模型。然而，它需要采样许多架构，使其在范围内搜索的计算昂贵。为了解决这些问题，我们提出了一种适应分布熵的采样方法，从一开始就绘制更多样品来鼓励探索，并随着学习的进行减少样本。此外，要在多变量空间中快速搜索，我们通过在开始时使用分解分布来提出一种粗到最新的策略，该分布可以将体系结构参数数量降低到一个数量级。我们称此方法为快速概率NAS（FP-NAS）。与PARSEC相比，它可以少64％的架构，并更快地搜索2.1倍。与FBNETV2相比，FP -NAS的速度更快为1.9倍-3.5倍，并且搜索模型的表现优于ImageNet上的FBNETV2模型。 FP-NAS允许我们扩大巨大的FBNETV2空间，以更宽（即更大的通道选择）和更深层次（即更多的块），同时添加分裂注意区块并启用搜索分割数量。当搜索大小0.4G拖鞋的模型时，FP-NAS比EfficityNET快132倍，而搜索的FP-NAS-L0模型的表现优于0.7％的精度。在不使用任何体系结构替代或缩放技巧的情况下，我们直接搜索最高1.0克失败的大型型号。我们具有简单蒸馏的FP-NAS-L2模型的表现优于bignas-XL，使用类似的拖鞋，其先进的现场蒸馏量高于0.7％的精度。

Differential Neural Architecture Search (NAS) requires all layer choices to be held in memory simultaneously; this limits the size of both search space and final architecture. In contrast, Probabilistic NAS, such as PARSEC, learns a distribution over high-performing architectures, and uses only as much memory as needed to train a single model. Nevertheless, it needs to sample many architectures, making it computationally expensive for searching in an extensive space. To solve these problems, we propose a sampling method adaptive to the distribution entropy, drawing more samples to encourage explorations at the beginning, and reducing samples as learning proceeds. Furthermore, to search fast in the multi-variate space, we propose a coarse-to-fine strategy by using a factorized distribution at the beginning which can reduce the number of architecture parameters by over an order of magnitude. We call this method Fast Probabilistic NAS (FP-NAS). Compared with PARSEC, it can sample 64% fewer architectures and search 2.1x faster. Compared with FBNetV2, FP-NAS is 1.9x - 3.5x faster, and the searched models outperform FBNetV2 models on ImageNet. FP-NAS allows us to expand the giant FBNetV2 space to be wider (i.e. larger channel choices) and deeper (i.e. more blocks), while adding Split-Attention block and enabling the search over the number of splits. When searching a model of size 0.4G FLOPS, FP-NAS is 132x faster than EfficientNet, and the searched FP-NAS-L0 model outperforms EfficientNet-B0 by 0.7% accuracy. Without using any architecture surrogate or scaling tricks, we directly search large models up to 1.0G FLOPS. Our FP-NAS-L2 model with simple distillation outperforms BigNAS-XL with advanced in-place distillation by 0.7% accuracy using similar FLOPS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题