论文标题
S-向量和TESA:基于变压器编码器的扬声器嵌入和扬声器身份验证器
S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder
论文作者
论文摘要
最受欢迎的说话者嵌入者之一是X向量,它们是从逐渐构建具有图层的更大时间上下文的体系结构中获得的。在本文中,我们建议从Transformer的编码器中得出扬声器的嵌入,以训练扬声器分类。自我注意事项(Transformer的编码器构建)都在整个话语中都关注所有功能,并且可能更适合于捕获发言人的扬声器特征。我们将从提议的说话者分类模型获得的说话者嵌入为S-向量,以强调它们是从依赖自我注意力的建筑中获得的。通过实验,我们证明了S-矢量的性能优于X-向量。除S媒介外,我们还提出了一个基于变压器编码器的扬声器验证的新体系结构,以替代基于常规概率线性判别分析(PLDA)的说话者验证。该体系结构的灵感来自Transformers(Bert)的双向编码器表示的下一个句子预测任务,我们为两个话语的S-向量提供了两个话语,以验证它们是否属于同一说话者。我们将此体系结构命名为变压器扬声器身份验证器(TESA)。我们的实验表明,使用TESA的S-向量的性能比具有常规PLDA的扬声器验证的S-向量更好。
One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer's encoder for speaker verification as a replacement for speaker verification based on conventional probabilistic linear discriminant analysis (PLDA). This architecture is inspired by the next sentence prediction task of bidirectional encoder representations from Transformers (BERT), and we feed the s-vectors of two utterances to verify whether they belong to the same speaker. We name this architecture the Transformer encoder speaker authenticator (TESA). Our experiments show that the performance of s-vectors with TESA is better than s-vectors with conventional PLDA-based speaker verification.