分析预先训练的自我监督语音模型中的声词嵌入

论文标题

分析预先训练的自我监督语音模型中的声词嵌入

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

论文作者

Sanabria, Ramon, Tang, Hao, Goldwater, Sharon

论文摘要

鉴于自我监督模型在各种任务上的强劲结果，令人惊讶的是，很少有研究探索声音嵌入（AWE），固定维矢量，代表可变长度的口语段的固定维矢量。在这项工作中，我们研究了几种预先训练的模型和汇总方法，用于构建具有自我监督的表示的敬畏。由于自我监督表示的上下文化性质，我们假设简单的汇总方法（例如平均）可能已经对构建敬畏的方式可能很有用。在评估标准单词歧视任务时，我们发现Hubert表示具有平均值的竞争对手是英语敬畏的艺术状况。更令人惊讶的是，尽管仅接受英语培训，但Hubert表示，在Xitsonga，Mandarin和French上进行了评估，始终超过多语言模型XLSR-53（以及对英语培训的WAV2VEC 2.0）。

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53 (as well as Wav2Vec 2.0 trained on English).

下载PDF全文

下载文献需遵守相关版权规定

论文标题