论文标题
重新评估从头分子生成的样品效率
Re-evaluating sample efficiency in de novo molecule generation
论文作者
论文摘要
从头分子的产生可能会遭受数据效率低下的影响;需要大量的培训数据或许多采样数据点才能进行客观优化。后者是在计算机辅助药物设计中常用的,将深层生成模型与计算昂贵的分子评分功能(又称分子评分函数(又称分子)相结合时,后者是一个特殊的劣势。因此,最近的工作集中于在从头分子药物设计的背景下提高样品效率的方法,或对其进行基准测试。在这项工作中,我们讨论并调整了最近的样本效率基准,以更好地反映实际产生的化学质量的现实目标,在小分子药物设计的背景下,必须始终考虑这些目标。然后,我们重新评估所有基准的生成模型。我们发现,相对于训练数据的分子量和logP的占理解,以及提出的化学的多样性,重新考虑了生成模型的排名。此外,我们基准了一种最近提出的提高样品效率的方法(增强的山坡爬升),并在考虑产生的样品效率和化学性质时发现它排名最高。样本效率和化学可取性的持续提高可以使计算昂贵的评分功能在更现实的时间范围内进行更多的常规整合。
De novo molecule generation can suffer from data inefficiency; requiring large amounts of training data or many sampled data points to conduct objective optimization. The latter is a particular disadvantage when combining deep generative models with computationally expensive molecule scoring functions (a.k.a. oracles) commonly used in computer-aided drug design. Recent works have therefore focused on methods to improve sample efficiency in the context of de novo molecule drug design, or to benchmark it. In this work, we discuss and adapt a recent sample efficiency benchmark to better reflect realistic goals also with respect to the quality of chemistry generated, which must always be considered in the context of small-molecule drug design; we then re-evaluate all benchmarked generative models. We find that accounting for molecular weight and LogP with respect to the training data, and the diversity of chemistry proposed, re-orders the ranking of generative models. In addition, we benchmark a recently proposed method to improve sample efficiency (Augmented Hill-Climb) and found it ranked top when considering both the sample efficiency and chemistry of molecules generated. Continual improvements in sample efficiency and chemical desirability enable more routine integration of computationally expensive scoring functions on a more realistic timescale.