论文标题
混合手工制作且可学习的音频表示,用于分析认知和身体负荷下的语音
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load
论文作者
论文摘要
作为对威胁或不利条件的神经生理学反应,在持续暴露的情况下,压力会影响认知,情绪和行为,对健康有可能有害。由于语音的情感内容固有地由个人的身心状态调节,因此大量研究专门研究了引起压力的任务负荷的副语言相关性。从历史上看,语音应力分析(VSA)是使用常规数字信号处理(DSP)技术进行的。尽管基于深层神经网络(DNN)的现代方法发展了现代方法,但由于各种压力源和个体压力感知的差异,精确检测语音压力仍然很困难。为此,我们引入了一组五个数据集,以用于语音中的任务负载检测。在志愿者队列中诱发了认知或身体压力,累积数量超过一百位讲话者,因此收集了语音记录。我们使用数据集设计和评估了一种新型的自我监督音频表示,该音频表示利用了手工制作的特征(基于DSP)的有效性和数据驱动的DNN表示的复杂性。值得注意的是,所提出的方法的表现超过了广泛的手工特征集和新颖的基于DNN的音频表示方法。
As a neurophysiological response to threat or adverse conditions, stress can affect cognition, emotion and behaviour with potentially detrimental effects on health in the case of sustained exposure. Since the affective content of speech is inherently modulated by an individual's physical and mental state, a substantial body of research has been devoted to the study of paralinguistic correlates of stress-inducing task load. Historically, voice stress analysis (VSA) has been conducted using conventional digital signal processing (DSP) techniques. Despite the development of modern methods based on deep neural networks (DNNs), accurately detecting stress in speech remains difficult due to the wide variety of stressors and considerable variability in the individual stress perception. To that end, we introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers, with a cumulative number of more than a hundred speakers. We used the datasets to design and evaluate a novel self-supervised audio representation that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations. Notably, the proposed approach outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.