论文标题
什么是蛋白质序列的有意义的表示?
What is a meaningful representation of protein sequences?
论文作者
论文摘要
我们如何选择表示数据对我们随后从中提取信息的能力有根本的影响。机器学习有望自动确定来自大型非结构化数据集的有效表示,例如生物学中产生的数据集。但是,经验证据表明,这些机器学习模型的看似较小的变化产生了截然不同的数据表示,从而导致对数据的不同生物学解释。这就提出了一个问题,即什么构成了最有意义的代表。在这里,我们将这个问题与蛋白质序列表示,这些蛋白序列在最近的文献中受到了很大的关注。我们探索了自然出现的表征的两个关键环境:转移学习和可解释的学习。在第一种情况下,我们证明了几种当代实践会产生次优性能,在后者中,我们证明,考虑代表几何形状可以显着提高可解释性,并让模型揭示出否则被掩盖的生物学信息。
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.