蛋白质结构的蛋白质组规模部署峰值超级计算机上的预测工作流程

论文标题

蛋白质结构的蛋白质组规模部署峰值超级计算机上的预测工作流程

Proteome-scale Deployment of Protein Structure Prediction Workflows on the Summit Supercomputer

论文作者

Gao, Mu, Coletti, Mark, Davidson, Russell B., Prout, Ryan, Abraham, Subil, Hernandez, Benjamin, Sedova, Ada

论文摘要

深度学习促进了从序列预测蛋白质结构的重大进展，序列是结构生物信息学的基本问题。在某些情况下，随着预测现在接近晶体学分辨率的准确性，并且随着GPU和TPU等加速器的推理，使用大型模型快速，快速基因组级结构的结构预测成为明显的目标。领导级计算资源可用于使用最先进的深度学习模型来执行基因组规模的蛋白质结构预测，从而为系统生物学应用提供大量新数据。在这里，我们描述了我们在橡树岭领导力计算设施的资源（包括Summit SuperCuptuter）上进行大规模进行全蛋白质结构预测的有效部署AlphaFold2程序的努力。我们进行了推断，以生成35,634个蛋白序列的预测结构，这些结构对应于三个原核蛋白质组和一个植物蛋白质组，使用了4,000次总峰会节点小时，相当于使用大多数超级计算机一小时。我们还设计了一种优化的结构改进，可将Alphafold管道松弛阶段的时间减少超过10倍，以超过10倍。我们演示了可以在序列的蛋白质组规模集合中执行的分析类型，包括搜索新型的第四纪结构以及对功能注释的影响。

Deep learning has contributed to major advances in the prediction of protein structure from sequence, a fundamental problem in structural bioinformatics. With predictions now approaching the accuracy of crystallographic resolution in some cases, and with accelerators like GPUs and TPUs making inference using large models rapid, fast genome-level structure prediction becomes an obvious aim. Leadership-class computing resources can be used to perform genome-scale protein structure prediction using state-of-the-art deep learning models, providing a wealth of new data for systems biology applications. Here we describe our efforts to efficiently deploy the AlphaFold2 program, for full-proteome structure prediction, at scale on the Oak Ridge Leadership Computing Facility's resources, including the Summit supercomputer. We performed inference to produce the predicted structures for 35,634 protein sequences, corresponding to three prokaryotic proteomes and one plant proteome, using under 4,000 total Summit node hours, equivalent to using the majority of the supercomputer for one hour. We also designed an optimized structure refinement that reduced the time for the relaxation stage of the AlphaFold pipeline by over 10X for longer sequences. We demonstrate the types of analyses that can be performed on proteome-scale collections of sequences, including a search for novel quaternary structures and implications for functional annotation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题