迈向维护扬声器的身份的一步，同时通过扬声器解开来检测抑郁症

论文标题

迈向维护扬声器的身份的一步，同时通过扬声器解开来检测抑郁症

A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement

论文作者

Ravi, Vijay, Wang, Jinhan, Flint, Jonathan, Alwan, Abeer

论文摘要

保留患者的身份是对基于语音的精神健康障碍诊断的挑战。在本文中，我们通过提出对抑郁特征和说话者身份的对抗性分解来解决这个问题。用于抑郁症分类的模型通过最大程度地减少抑郁预测损失并最大化训练期间的说话者预测损失，以扬声器认同的方式进行训练。使用修改后的depaudionet模型，在两个数据集中证明了该方法的有效性 - daic-woz（英语）和收敛（普通话），其中三个功能集（Mel-Spectrograms，Raw-Audio信号，Raw-Audio信号和WAV2VEC2.0的最后一个隐藏状态）。通过对抗训练，与基线相比，抑郁症分类可改善每个功能。 WAV2VEC2.0具有对抗性学习的功能可实现最佳性能（DAIC-WOZ的F1得分为69.2％，Converge的F1得分为91.5％）。对Depaudionet模型的隐藏状态的类别性度量（J-RATIO）的分析表明，当应用对抗性学习时，后端模型会在改善抑郁症的抑郁症中失去某些说话者 - 歧义性。这些结果表明，有些说话者身份的组成部分可能对抑郁症检测没有用，并最大程度地减少其效果可提供对基本疾病的更准确诊断，并且可以保护说话者的身份。

Preserving a patient's identity is a challenge for automatic, speech-based diagnosis of mental health disorders. In this paper, we address this issue by proposing adversarial disentanglement of depression characteristics and speaker identity. The model used for depression classification is trained in a speaker-identity-invariant manner by minimizing depression prediction loss and maximizing speaker prediction loss during training. The effectiveness of the proposed method is demonstrated on two datasets - DAIC-WOZ (English) and CONVERGE (Mandarin), with three feature sets (Mel-spectrograms, raw-audio signals, and the last-hidden-state of Wav2vec2.0), using a modified DepAudioNet model. With adversarial training, depression classification improves for every feature when compared to the baseline. Wav2vec2.0 features with adversarial learning resulted in the best performance (F1-score of 69.2% for DAIC-WOZ and 91.5% for CONVERGE). Analysis of the class-separability measure (J-ratio) of the hidden states of the DepAudioNet model shows that when adversarial learning is applied, the backend model loses some speaker-discriminability while it improves depression-discriminability. These results indicate that there are some components of speaker identity that may not be useful for depression detection and minimizing their effects provides a more accurate diagnosis of the underlying disorder and can safeguard a speaker's identity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题