论文标题
长生不老药:在小型GPU群集上训练大型语言模型
Elixir: Train a Large Language Model on a Small GPU Cluster
论文作者
论文摘要
近年来,由于其空前的规模,大型语言模型取得了巨大的成功。但是,培训这些模型对大多数研究人员构成了挑战,因为它需要大量的GPU。为了减少GPU内存使用情况,已经提出了内存分区和内存卸载。这些方法分别消除了内存冗余,并分别将内存使用量分别为CPU和NVME内存,从而可以对小GPU群集进行培训。但是,直接部署这些解决方案通常会导致次优效率。只有经验丰富的专家才能通过仔细调整分布式配置来释放硬件的全部潜力。因此,我们提出了一种新颖的解决方案Elixir,该解决方案是基于暴期间模型分析的有效大型模型训练的自动化。 Elixir旨在确定分区和卸载技术的最佳组合,以最大程度地提高训练吞吐量。在我们的实验中,长生不老药显着胜过当前的最新基线。与SOTA解决方案相比,我们的最佳配置在GPT-2型号上达到了3.4 $ \ times $加速。我们希望我们的工作将使缺乏计算资源和专业知识的个人受益,从而使他们访问大型模型。现在可以在https://github.com/hpcaitech/colossalai/tree/feature/elixir上获得Beta版本的Elixir版本。
In recent years, large language models have achieved great success due to their unprecedented size. However, training these models poses a challenge for most researchers as it requires a substantial number of GPUs. To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed. These approaches eliminate memory redundancies and offload memory usage to the CPU and NVMe memory, respectively, enabling training on small GPU clusters. However, directly deploying these solutions often leads to suboptimal efficiency. Only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration. Thus, we present a novel solution, Elixir, which automates efficient large-model training based on pre-runtime model profiling. Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize training throughput. In our experiments, Elixir significantly outperforms the current state-of-the-art baseline. Our optimal configuration achieves up to a 3.4$\times$ speedup on GPT-2 models compared with SOTA solutions. We hope that our work will benefit individuals who lack computing resources and expertise, granting them access to large models. The beta version of Elixir is now available at https://github.com/hpcaitech/ColossalAI/tree/feature/elixir.