Chemberta：大规模的自我监督预审议的分子性质预测

论文标题

Chemberta：大规模的自我监督预审议的分子性质预测

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

论文作者

Chithrananda, Seyone, Grand, Gabriel, Ramsundar, Bharath

论文摘要

GNN和化学指纹是代表财产预测分子的主要方法。但是，在NLP中，由于其强大的下游任务转移，变压器已成为表示学习的事实上的标准。同时，变形金刚周围的软件生态系统正在迅速成熟，Huggingface和Bertviz等图书馆可以简化训练和内省。在这项工作中，我们是通过我们的Chemberta模型系统地对分子属性预测任务进行系统评估变压器的首次尝试之一。 Chemberta的数据集大小很好地缩放，可在分子和有用的基于注意力的可视化方式上提供竞争性下游性能。我们的结果表明，变形金刚为分子表示学习和财产预测提供了有希望的未来工作途径。为了促进这些努力，我们从PubChem中释放了一个适合大规模自我监督预处理的PubChem的策划数据集。

GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题