论文标题
图形神经网络用于乳腺癌数据整合
Graph Neural Networks for Breast Cancer Data Integration
论文作者
论文摘要
国际倡议,例如代理(乳腺癌国际财团的分子分类法),已收集了几种多基因组和临床数据集,以识别各种癌症的演变过程中发生的分子过程。已经设计和培训了许多机器学习和统计模型,以独立分析这些类型的数据,但是,尚未对这种形状不同的信息流进行整合。为了更好地整合这些数据集并产生有意义的表示,最终可以利用这些数据集来进行癌症检测任务,这可能会导致对患者提供合适的治疗方法。因此,我们提出了一条新的学习管道,该管道包括三个步骤 - 将癌症数据模式的整合作为图形,然后将图神经网络应用于无监督的设置中,以从组合数据中生成较低维度的筛分,并最终在癌症子类别分类模型上以评估癌症的分类模型来喂食新的表示。图形构建算法被深入描述为代表不存储患者方式之间的关系,并讨论了它们对生成的嵌入质量的影响。我们还提出了用于生成较低空间表示形式的模型:图形神经网络,变异图自动编码器和深图形信息。同时,在合成数据集上测试了管道,以证明基本数据的特征(例如同质级别)极大地影响了管道的性能,该管道的性能在人工数据上的51 \%至98%\%的准确性,以及13 \%\%\%和80 \%的范围。该项目有可能提高癌症数据理解并鼓励常规数据集向图形数据的过渡。
International initiatives such as METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) have collected several multigenomic and clinical data sets to identify the undergoing molecular processes taking place throughout the evolution of various cancers. Numerous Machine Learning and statistical models have been designed and trained to analyze these types of data independently, however, the integration of such differently shaped and sourced information streams has not been extensively studied. To better integrate these data sets and generate meaningful representations that can ultimately be leveraged for cancer detection tasks could lead to giving well-suited treatments to patients. Hence, we propose a novel learning pipeline comprising three steps - the integration of cancer data modalities as graphs, followed by the application of Graph Neural Networks in an unsupervised setting to generate lower-dimensional embeddings from the combined data, and finally feeding the new representations on a cancer sub-type classification model for evaluation. The graph construction algorithms are described in-depth as METABRIC does not store relationships between the patient modalities, with a discussion of their influence over the quality of the generated embeddings. We also present the models used to generate the lower-latent space representations: Graph Neural Networks, Variational Graph Autoencoders and Deep Graph Infomax. In parallel, the pipeline is tested on a synthetic dataset to demonstrate that the characteristics of the underlying data, such as homophily levels, greatly influence the performance of the pipeline, which ranges between 51\% to 98\% accuracy on artificial data, and 13\% and 80\% on METABRIC. This project has the potential to improve cancer data understanding and encourages the transition of regular data sets to graph-shaped data.