从一开始就连续伪标记

论文标题

从一开始就连续伪标记

Continuous Pseudo-Labeling from the Start

论文作者

Berrebbi, Dan, Collobert, Ronan, Bengio, Samy, Jaitly, Navdeep, Likhomanenko, Tatiana

论文摘要

自我训练（ST）或伪标记，最近由于其在利用未标记的数据方面的成功，对自动语音识别（ASR）社区引起了重大兴趣。与以前的半监督学习方法不同，依赖于训练有素的模型的迭代再生伪标签（PLS），并使用它们来训练新的模型，最近的最新方法执行“连续训练”，其中PLS是使用训练模型的最新版本生成的。然而，这些方法仍然依靠引导ST使用初始监督的学习阶段，在该阶段仅在标签数据上训练该模型。我们认为，这有可能在低资源设置中过度适合标记的数据集，并且从培训开始时，ST应减少过度拟合。在本文中，我们展示了如何通过在ASR的训练过程中动态控制PL的演变来做到这一点。据我们所知，这是第一项研究，该研究表明从培训开始就产生PL的可行性。我们能够使用两种避免不稳定性的技术来实现这一目标，从而导致不概括的模型。首先，我们通过使用PLS的在线变化来控制PLS的构件并改善概括的课程来控制PLS的演变。其次，我们发现，通过从预测分布中抽样转录，而不仅仅是使用最佳转录，我们可以进一步稳定训练。通过这些技术，我们的ST模型与没有外部语言模型的先验作品相匹配。

Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform `continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题