基于分子 - 基于轨道的机器学习的准确性和可传递性：有机物，过渡金属复合物，非共价相互作用和过渡状态

论文标题

基于分子 - 基于轨道的机器学习的准确性和可传递性：有机物，过渡金属复合物，非共价相互作用和过渡状态

Improved accuracy and transferability of molecular-orbital-based machine learning: Organics, transition-metal complexes, non-covalent interactions, and transition states

论文作者

Husch, Tamara, Sun, Jiace, Cheng, Lixue, Lee, Sebastian J. R., Miller III, Thomas F.

论文摘要

基于分子 - 基于轨道的机器学习（MOB-ML）为预测准确的相关能量以获得分子轨道的成本提供了一般框架。在生成机器学习模型的输入时，我们证明了保留物理约束的重要性，包括不变条件和尺寸一致性。对于涵盖含有可及的有机和过渡金属含有分子，非共价相互作用以及过渡状态能量的多种数据集和相对能量的不同数据集，证明了数值改进。 MOB-ML只需要从QM7B-T数据集的1％（即只有70个具有七个且更少的重原子的有机分子）中的训练数据，以预测其剩余的99％数据集的总能量，该数据集具有亚kcal/mol精度。当转移到由13个重原子分子组成的数据集时，该MOB-ML模型比其他方法明显更准确，在大小强度（即每电子）基础上没有表现出准确性的损失。结果表明，BOB-ML在外推到过渡状态结构也很好地效果，可以预测仅在仅接受反应物/产物样结构训练时，将丙二醛分子内质子转移到0.35 kcal/mol以内。最后，使用高斯工艺方差的使用实现了一种主动学习策略，将MOB-ML模型扩展到新的化学空间区域，并以最小的努力将其扩展到新的化学空间区域。我们通过扩展QM7B-T模型来描述蛋白质骨链 - 贝克骨相互作用数据集中的非共价相互作用来证明这种主动学习策略。精度为0.28 kcal/mol。

Molecular-orbital-based machine learning (MOB-ML) provides a general framework for the prediction of accurate correlation energies at the cost of obtaining molecular orbitals. We demonstrate the importance of preserving physical constraints, including invariance conditions and size consistency, when generating the input for the machine learning model. Numerical improvements are demonstrated for different data sets covering total and relative energies for thermally accessible organic and transition-metal containing molecules, non-covalent interactions, and transition-state energies. MOB-ML requires training data from only 1% of the QM7b-T data set (i.e., only 70 organic molecules with seven and fewer heavy atoms) to predict the total energy of the remaining 99% of this data set with sub-kcal/mol accuracy. This MOB-ML model is significantly more accurate than other methods when transferred to a data set comprised of thirteen heavy atom molecules, exhibiting no loss of accuracy on a size intensive (i.e., per-electron) basis. It is shown that MOB-ML also works well for extrapolating to transition-state structures, predicting the barrier region for malonaldehyde intramolecular proton-transfer to within 0.35 kcal/mol when only trained on reactant/product-like structures. Finally, the use of the Gaussian process variance enables an active learning strategy for extending MOB-ML model to new regions of chemical space with minimal effort. We demonstrate this active learning strategy by extending a QM7b-T model to describe non-covalent interactions in the protein backbone-backbone interaction data set to an accuracy of 0.28 kcal/mol.

下载PDF全文

下载文献需遵守相关版权规定

论文标题