论文标题

事先适用于EHR表型的自适应半监督学习

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping

论文作者

Zhang, Yichi, Liu, Molei, Neykov, Matey, Cai, Tianxi

论文摘要

电子健康记录(EHR)数据是一项丰富的生物医学研究来源,已成功地用于获得各种疾病的新见解。尽管它具有潜力,但由于缺乏精确的表型信息的主要限制,EHR目前无法用于发现研究。为了克服此类困难,最近的努力致力于开发监督算法,以基于相对较小的培训数据集准确预测表型,并通过图表审查提取的黄金标准标签。但是,监督方法通常需要大量的培训集,以产生可推广的算法,尤其是当候选功能的数量$ P $很大时。 In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small labeled data where both the label $Y$ and the feature set $X$ are observed and a much larger unlabeled data with observations on $X$ only as well as a surrogate variable $S$ that is predictive of $Y$ and available for all patients, under a high dimensional setting.在此事先假设的情况下,$ s $仅与$ x $仅通过$ y $有关,并允许它大致保持,我们提出了先前的自适应半监督(通过)估算器,该估算值通过将估算器朝着先前的方向缩小方向缩小了估计器,从而适应了先验知识。我们为提出的估计量提供了渐近理论,并通过模拟研究证明了其优于现有估计量的优势。所提出的方法应用于伴侣医疗保健中类风湿关节炎的EHR表型研究。

Electronic Health Records (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to it's major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms especially when the number of candidate features, $p$, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small labeled data where both the label $Y$ and the feature set $X$ are observed and a much larger unlabeled data with observations on $X$ only as well as a surrogate variable $S$ that is predictive of $Y$ and available for all patients, under a high dimensional setting. Under a working prior assumption that $S$ is related to $X$ only through $Y$ and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that adaptively incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and demonstrate its superiority over existing estimators via simulation studies. The proposed method is applied to an EHR phenotyping study of rheumatoid arthritis at Partner's Healthcare.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源