论文标题

用于管理容错的HPC系统中的能源消耗

Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

论文作者

Morán, Marina, Balladini, Javier, Rexachs, Dolores, Rucci, Enzo

论文摘要

高性能计算继续提高其计算能力和能源效率。但是,能源消耗继续上升并找到限制和/或减少的方法,这是当前研究的关键点。对于高性能MPI应用程序,有基于回滚恢复的容错方法,例如不协调的检查点。这些方法只允许在失败时返回一些过程,而其余过程继续运行。在本文中,我们专注于继续执行的过程,并提出一系列在发生故障并使用不协调检查点时管理能源消耗的策略。我们提出了一个能量模型来评估策略并通过模拟,我们分析了在不同的配置和故障时间下应用程序的行为。结果,我们显示了在存在故障的情况下提高HPC系统能源效率的可行性。

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源