论文标题
2020年代计算环境中的高性能统计计算
High-Performance Statistical Computing in the Computing Environments of the 2020s
论文作者
论文摘要
在过去的十年中,技术进步,硬件和软件都比以往任何时候都更容易访问高性能计算(HPC)。我们从统计计算的角度回顾了这些进步。云计算使得对超级计算机负担得起。深度学习软件库使编程统计算法变得容易,并使用户能够编写代码并将其运行到任何地方 - 从笔记本电脑到具有多个图形处理单元(GPU)(GPU)的工作站或云中的超级计算机。我们重点介绍了这些发展如何使统计学家受益,我们回顾了最新的优化算法,这些算法对高维模型有用,并且可以利用HPC的力量。提供了代码片段以证明编程的易度性。我们还提供适合HPC的易于使用的分布式矩阵数据结构。使用此数据结构,我们说明了各种统计应用程序,包括大规模正电子发射断层扫描和$ \ ell_1 $ regultarized Cox回归。我们的示例很容易扩展到8-GPU工作站和云中的720-CPU核心群集。作为一个很好的例子,我们使用HPC $ \ ell_1 $ regularized Cox回归分析了英国生物库的2型糖尿病的发作。适合这个五十万个变量的模型不到45分钟,并重新确认已知的关联。据我们所知,这是该规模上惩罚生存结果回归的可行性的首次证明。
Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. We review these advances from a statistical computing perspective. Cloud computing makes access to supercomputers affordable. Deep learning software libraries make programming statistical algorithms easy and enable users to write code once and run it anywhere -- from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. Highlighting how these developments benefit statisticians, we review recent optimization algorithms that are useful for high-dimensional models and can harness the power of HPC. Code snippets are provided to demonstrate the ease of programming. We also provide an easy-to-use distributed matrix data structure suitable for HPC. Employing this data structure, we illustrate various statistical applications including large-scale positron emission tomography and $\ell_1$-regularized Cox regression. Our examples easily scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, we analyze the onset of type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000 single nucleotide polymorphisms using the HPC $\ell_1$-regularized Cox regression. Fitting this half-million-variate model takes less than 45 minutes and reconfirms known associations. To our knowledge, this is the first demonstration of the feasibility of penalized regression of survival outcomes at this scale.