在社交媒体上探索多类心理健康状况预测的混合和合奏模型

论文标题

在社交媒体上探索多类心理健康状况预测的混合和合奏模型

Exploring Hybrid and Ensemble Models for Multiclass Prediction of Mental Health Status on Social Media

论文作者

Zanwar, Sourabh, Wiechmann, Daniel, Qiao, Yu, Kerz, Elma

论文摘要

近年来，从社交媒体数据中利用自然语言处理和机器学习技术的进步，人们对自动心理健康检测（MHD）的研究激增。尽管在这个跨学科研究领域取得了重大进展，但绝大多数工作将MHD视为二元分类任务。但是，如果我们要揭示与特定心理健康状况相关的语言使用统计模式之间的细微差异，那么多类分类设置至关重要。在这里，我们报告了旨在预测Reddit社交媒体帖子中六种疾病（焦虑，注意力缺陷多动障碍，躁郁症，抑郁症和心理压力）的实验。我们探索和比较了利用基于变形金刚的体系结构（Bert和Roberta）和Bilstm神经网络的混合和集合模型的性能，这些网络培训了一系列语言特征的文本内部分布。这集包含句法复杂性，词汇复杂性和多样性，可读性和特定于寄存器的ngram频率以及情感和情感词典的度量。此外，我们进行了烧蚀实验，以研究哪些特征类型最能表明特定的心理健康状况。

In recent years, there has been a surge of interest in research on automatic mental health detection (MHD) from social media data leveraging advances in natural language processing and machine learning techniques. While significant progress has been achieved in this interdisciplinary research area, the vast majority of work has treated MHD as a binary classification task. The multiclass classification setup is, however, essential if we are to uncover the subtle differences among the statistical patterns of language use associated with particular mental health conditions. Here, we report on experiments aimed at predicting six conditions (anxiety, attention deficit hyperactivity disorder, bipolar disorder, post-traumatic stress disorder, depression, and psychological stress) from Reddit social media posts. We explore and compare the performance of hybrid and ensemble models leveraging transformer-based architectures (BERT and RoBERTa) and BiLSTM neural networks trained on within-text distributions of a diverse set of linguistic features. This set encompasses measures of syntactic complexity, lexical sophistication and diversity, readability, and register-specific ngram frequencies, as well as sentiment and emotion lexicons. In addition, we conduct feature ablation experiments to investigate which types of features are most indicative of particular mental health conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题