莱姆病患者调查数据的特征选择

论文标题

莱姆病患者调查数据的特征选择

Feature Selection on Lyme Disease Patient Survey Data

论文作者

Vendrow, Joshua, Haddock, Jamie, Needell, Deanna, Johnson, Lorraine

论文摘要

莱姆病是一种迅速发展的疾病，在医学界仍然知之甚少。关于患者何时以及为何应对治疗或病情病的关键问题，哪些治疗方法有效，甚至如何正确诊断该疾病的疗法仍然在很大程度上没有得到答复。我们通过将机器学习技术应用于非营利性lymedisease.org开发的大规模莱姆病患者注册中心，调查这些问题。我们采用各种机器学习方法，以衡量各个特征在预测参与者对全球变化评级（GROC）调查问题的回答，这些问题评估了抗生素治疗后其状况改善，恶化或保持不变的自我报告程度。我们使用基本的线性回归，支持向量机，神经网络，基于熵的决策树模型以及$ k $ neart的邻居接近。我们首先分析模型的一般性能，然后确定预测参与者对GROC的答案的最重要特征。确定“密钥”功能后，我们将它们与数据集分开，并在识别GROC中演示这些功能的有效性。在此过程中，我们在数学和临床上突出了未来研究的可能方向。

Lyme disease is a rapidly growing illness that remains poorly understood within the medical community. Critical questions about when and why patients respond to treatment or stay ill, what kinds of treatments are effective, and even how to properly diagnose the disease remain largely unanswered. We investigate these questions by applying machine learning techniques to a large scale Lyme disease patient registry, MyLymeData, developed by the nonprofit LymeDisease.org. We apply various machine learning methods in order to measure the effect of individual features in predicting participants' answers to the Global Rating of Change (GROC) survey questions that assess the self-reported degree to which their condition improved, worsened, or remained unchanged following antibiotic treatment. We use basic linear regression, support vector machines, neural networks, entropy-based decision tree models, and $k$-nearest neighbors approaches. We first analyze the general performance of the model and then identify the most important features for predicting participant answers to GROC. After we identify the "key" features, we separate them from the dataset and demonstrate the effectiveness of these features at identifying GROC. In doing so, we highlight possible directions for future study both mathematically and clinically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题