采矿与半监督学习相关的基因

论文标题

采矿与半监督学习相关的基因

Mining Functionally Related Genes with Semi-Supervised Learning

论文作者

Shen, Kaiyu, Bunescu, Razvan, Wyatt, Sarah E.

论文摘要

对生物过程的研究可以大大受益于自动预测基因功能或基于共享功能直接聚类基因的工具。现有的数据挖掘方法通过利用从公共数据库中获得的高通量实验或元尺度信息获得的数据来预测蛋白质功能。大多数现有的预测工具针对预测基因本体论（GO）中描述的蛋白质功能。但是，在许多情况下，生物学家希望发现GO术语与功能相关的基因不足。在本文中，我们引入了丰富的特征，并将其与半监视的学习方法结合使用，以将初始的种子基因集扩展到更大的功能相关基因群。在评估的所有半监督方法中，以正面和未标记的例子（LPU）学习框架尤其适合于采矿功能相关的基因。当对实验验证的基准数据进行评估时，LPU接近1的表现明显优于标准监督学习算法以及已建立的最新方法。鉴于初始的种子基因，我们的最佳性能方法可用于在各种生物体中挖掘与功能相关的基因。

The study of biological processes can greatly benefit from tools that automatically predict gene functions or directly cluster genes based on shared functionality. Existing data mining methods predict protein functionality by exploiting data obtained from high-throughput experiments or meta-scale information from public databases. Most existing prediction tools are targeted at predicting protein functions that are described in the gene ontology (GO). However, in many cases biologists wish to discover functionally related genes for which GO terms are inadequate. In this paper, we introduce a rich set of features and use them in conjunction with semisupervised learning approaches in order to expand an initial set of seed genes to a larger cluster of functionally related genes. Among all the semi-supervised methods that were evaluated, the framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes. When evaluated on experimentally validated benchmark data, the LPU approaches1 significantly outperform a standard supervised learning algorithm as well as an established state-of-the-art method. Given an initial set of seed genes, our best performing approach could be used to mine functionally related genes in a wide range of organisms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题