论文标题
学习计算流利语音中的单词可以在线语音识别
Learning to Count Words in Fluent Speech enables Online Speech Recognition
论文作者
论文摘要
序列模型的序列,尤其是变压器,实现了最新的技术导致自动语音识别。但是,实际用法仅限于可以接受全部话语潜伏期的情况。在这项工作中,我们介绍了塔里斯(Taris),这是一个基于变压器的在线语音识别系统,在辅助任务中的辅助词计数中。我们使用累积单词和动态分段语音并使其渴望解码为单词。在LRS2,Librispeech和Aishell-1英语和普通话语音的Aishell-1数据集上执行的实验表明,在线系统具有5个段的动态算法延迟时,在线系统与离线系统相当。此外,我们表明,估计的段长度分布类似于用强制对齐获得的单词长度分布,尽管我们的系统不需要确切的段到词对等效。与标准变压器相比,塔里斯(Taris)引入了可忽略不计的间接费用,而输入和输出之间的局部关系建模却通过设计授予序列长度。
Sequence to Sequence models, in particular the Transformer, achieve state of the art results in Automatic Speech Recognition. Practical usage is however limited to cases where full utterance latency is acceptable. In this work we introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting. We use the cumulative word sum to dynamically segment speech and enable its eager decoding into words. Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets of English and Mandarin speech show that the online system performs comparable with the offline one when having a dynamic algorithmic delay of 5 segments. Furthermore, we show that the estimated segment length distribution resembles the word length distribution obtained with forced alignment, although our system does not require an exact segment-to-word equivalence. Taris introduces a negligible overhead compared to a standard Transformer, while the local relationship modelling between inputs and outputs grants invariance to sequence length by design.