使用Dionysus的低数据化学数据集上概率模型的校准和概括性

论文标题

使用Dionysus的低数据化学数据集上概率模型的校准和概括性

Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS

论文作者

Tom, Gary, Hickman, Riley J., Zinzuwadia, Aniket, Mohajeri, Afshan, Sanchez-Lengeling, Benjamin, Aspuru-Guzik, Alan

论文摘要

利用大型数据集的深度学习模型通常是建模分子特性的最新技术。当数据集较小（<2000分子）时，尚不清楚深度学习方法是正确的建模工具。在这项工作中，我们对小型化学数据集对概率机器学习模型的校准和普遍性进行了广泛的研究。使用不同的分子表示和模型，我们分析了它们在各种任务（二进制，回归）和数据集中的预测质量和不确定性的质量。我们还介绍了两个评估其性能的模拟实验：（1）贝叶斯优化引导分子设计，（2）通过消融的群集拆分推断分布数据的数据。我们为建模小化学数据集建模模型和功能选择提供了实用的见解，这是新化学实验的常见情况。我们将分析打包到了Dionysus存储库中，该存储库是开源的，以帮助可重复性并扩展到新数据集。

Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题