论文标题

机器学习应用到DNA子序列和限制站点分析

Machine learning applications to DNA subsequence and restriction site analysis

论文作者

Moyer, Ethan J., Das, Anup

论文摘要

基于Biobricks标准,限制合成是一种新型的分解代谢DNA合成方法,它利用核酸内切酶从参考序列中合成查询序列。在这项工作中,使用三种不同的机器学习方法将参考顺序从较短的子序列中构建为适用或不适用合成方法:支持向量机(SVMS),随机森林和卷积神经网络(CNN)。在将这些方法应用于数据之前,应用一系列特征选择,策展和还原步骤来创建准确且具有代表性的特征空间。遵循这些预处理步骤,提出了三种不同的管道,以根据其核苷酸序列和其他相关特征对相对于200多个核酸内切酶的限制位点进行分类。使用SVM,随机森林和CNN的灵敏度分别为94.9%,92.7%,91.4%。此外,每种方法的特异性分别分别为77.4%,85.7%和82.4%。除了分析这些结果外,还研究了SVM和CNN中的错误分类。在这两个模型中,与其他特征相比,具有衍生核苷酸特异性的不同特征在视觉上对分类有更多的贡献。在考虑新的核苷酸灵敏度特征以供将来的研究时,该观察结果是一个重要因素。

Based on the BioBricks standard, restriction synthesis is a novel catabolic iterative DNA synthesis method that utilizes endonucleases to synthesize a query sequence from a reference sequence. In this work, the reference sequence is built from shorter subsequences by classifying them as applicable or inapplicable for the synthesis method using three different machine learning methods: Support Vector Machines (SVMs), random forest, and Convolution Neural Networks (CNNs). Before applying these methods to the data, a series of feature selection, curation, and reduction steps are applied to create an accurate and representative feature space. Following these preprocessing steps, three different pipelines are proposed to classify subsequences based on their nucleotide sequence and other relevant features corresponding to the restriction sites of over 200 endonucleases. The sensitivity using SVMs, random forest, and CNNs are 94.9%, 92.7%, 91.4%, respectively. Moreover, each method scores lower in specificity with SVMs, random forest, and CNNs resulting in 77.4%, 85.7%, and 82.4%, respectively. In addition to analyzing these results, the misclassifications in SVMs and CNNs are investigated. Across these two models, different features with a derived nucleotide specificity visually contribute more to classification compared to other features. This observation is an important factor when considering new nucleotide sensitivity features for future studies.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源