论文标题
极限模拟的数值算法设计的弹性
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations
论文作者
论文摘要
这项工作基于题为``极端模拟的数值算法设计弹性''的研讨会。 传统的弹性技术的幼稚版本不会扩展到Exascale制度:具有数十只之前的主要记忆足迹,同步编写检查点数据一直以频繁的间隔为背景存储,将在运行时和能量消耗中创建无法容忍的间接费用。预测表明,失败之间的平均时间可能低于从这种检查站恢复的时间,因此,如果未研究强大的替代方案,则大规模计算可能不会取得任何进展。 必须设计更先进的弹性技术。关键可能在于利用高级系统功能以及特定的应用程序知识。研究将面临两个基本问题:(1)特定计算的可靠性要求以及(2)我们如何最好地设计算法和软件以满足这些要求?一个途径是在检测到错误的情况下对系统或应用程序级检查点和回滚策略进行完善和改进。开发人员可能会使用故障通知接口和灵活的运行时系统来以应用程序依赖性方式响应节点故障。在面对不可检测的软误差时,可能需要采用新颖的数值算法或更多随机计算方法来满足准确性要求。 这个Dagstuhl研讨会的目的是将一群具有Exascale计算方面的专业知识的科学家汇集在一起,讨论新颖的方法,以使应用程序有弹性,以应对检测到的错误和未发现的故障。特别是,参与者探讨了算法和应用程序在应对这一挑战所需的整体方法中所起的作用。
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.