论文标题
分子表示学习语言模型和与域相关的辅助任务
Molecular representation learning with language models and domain-relevant auxiliary tasks
论文作者
论文摘要
我们应用变压器架构,特别是BERT,以学习药物发现问题的灵活和高质量的分子表示。我们研究了使用自制任务的不同组合进行预训练的影响,并为已建立的虚拟筛选和QSAR基准提出了结果。我们表明:i)选择适当的自我监督任务以进行预训练,对随后的下游任务(例如虚拟筛查)的性能产生了重大影响。 ii)使用与化学相关的辅助任务,例如学习预测计算出的分子特性,可以增加我们学习的表示的忠诚度。 iii)最后,我们表明,通过我们的模型“ Molbert”所学到的分子表示可以改善基准数据集上的最新技术状态。
We apply a Transformer architecture, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems. We study the impact of using different combinations of self-supervised tasks for pre-training, and present our results for the established Virtual Screening and QSAR benchmarks. We show that: i) The selection of appropriate self-supervised task(s) for pre-training has a significant impact on performance in subsequent downstream tasks such as Virtual Screening. ii) Using auxiliary tasks with more domain relevance for Chemistry, such as learning to predict calculated molecular properties, increases the fidelity of our learnt representations. iii) Finally, we show that molecular representations learnt by our model `MolBert' improve upon the current state of the art on the benchmark datasets.