论文标题
DNA甲基化数据以预测自杀和非杀伤死亡:一种机器学习方法
DNA Methylation Data to Predict Suicidal and Non-Suicidal Deaths: A Machine Learning Approach
论文作者
论文摘要
这项研究的目的是使用现代机器学习算法从DNA甲基化数据中预测自杀和非杀伤性死亡。我们使用支持矢量机来对现有的二级数据进行分类,这些数据包括来自两个皮质大脑区域组织的甲基化DNA探针强度的归一化值,以将自杀病例与对照病例区分开。在分类之前,我们采用了主成分分析(PCA)和T-分布的随机邻居嵌入(T-SNE)来降低数据的尺寸。与PCA相比,现代数据可视化方法T-SNE在降低维度方面的性能更好。 T-SNE解释了低维数据中可能的非线性模式。我们应用了四倍的交叉验证,其中将T-SNE产生的输出用作支持向量机(SVM)的训练数据。尽管使用了交叉验证,但对BA11数据的自杀死亡的名义上完美的预测表明该模型可能过度拟合。该研究也可能患有“光谱偏见”,因为仅从两个极端情况下研究了个体。这项研究构成了一项基线研究,用于从DNA甲基化数据中对自杀和非杀伤性死亡进行分类。未来的样本量较大的研究可能会纳入来自活人的甲基化数据,但可能会降低偏见并提高结果的准确性。
The objective of this study is to predict suicidal and non-suicidal deaths from DNA methylation data using a modern machine learning algorithm. We used support vector machines to classify existing secondary data consisting of normalized values of methylated DNA probe intensities from tissues of two cortical brain regions to distinguish suicide cases from control cases. Before classification, we employed Principal component analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimension of the data. In comparison to PCA, the modern data visualization method t-SNE performs better in dimensionality reduction. t-SNE accounts for the possible non-linear patterns in low-dimensional data. We applied four-fold cross-validation in which the resulting output from t-SNE was used as training data for the Support Vector Machine (SVM). Despite the use of cross-validation, the nominally perfect prediction of suicidal deaths for BA11 data suggests possible over-fitting of the model. The study also may have suffered from 'spectrum bias' since the individuals were only studied from two extreme scenarios. This research constitutes a baseline study for classifying suicidal and non-suicidal deaths from DNA methylation data. Future studies with larger sample size, while possibly incorporating methylation data from living individuals, may reduce the bias and improve the accuracy of the results.