探索数据驱动的化学微笑式令牌化方法，以识别关键的蛋白质结合部分

论文标题

探索数据驱动的化学微笑式令牌化方法，以识别关键的蛋白质结合部分

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

论文作者

Temizer, Asu Büşra, Uludoğan, Gökçe, Özçelik, Rıza, Koulani, Taha, Ozkirimli, Elif, Ulgen, Kutlu O., Karalı, Nilgün, Özgür, Arzucan

论文摘要

机器学习模型在计算药物发现中发现了许多成功的应用。这些模型中的大部分代表分子作为序列，因为分子序列易于获得，简单且信息丰富。基于序列的模型通常将分子序列分为称为化学单词（类似于构成人类语言的句子的单词），然后将高级自然语言处理技术应用于诸如$ \ textit {de de novo} $药物设计，财产设计，财产预测和绑定亲和力预测等任务。但是，这些构件的化学特性和重要性，化学单词仍未开发。这项研究旨在调查流行子单词令牌化算法产生的化学词汇，即字节对编码（BPE），文字和杂物，并识别与蛋白质配体结合相关的关键化学词。为此，我们构建了一条以语言风格的管道，将蛋白质靶标的高亲和力配体作为文档，并选择基于TF-IDF加权的这些配体的关键化学单词。此外，我们对许多蛋白质家族进行了案例研究，以分析关键化学词对结合的影响。通过我们的分析，我们发现这些关键的化学词与蛋白质靶标具有特异性，并且对应于已知的药物团和官能团。我们的发现将有助于阐明化学单词所捕获的化学反应，以及用于整个药物发现的机器学习模型。

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words (analogous to the words that make up sentences in human languages) and then apply advanced natural language processing techniques for tasks such as $\textit{de novo}$ drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. This study aims to investigate the chemical vocabularies generated by popular subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and Unigram, and identify key chemical words associated with protein-ligand binding. To this end, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. Further, we conduct case studies on a number of protein families to analyze the impact of key chemical words on binding. Through our analysis, we find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our findings will help shed light on the chemistry captured by the chemical words, and by machine learning models for drug discovery at large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题