论文标题
共同信息最大化,以进行有效的唇部阅读
Mutual Information Maximization for Effective Lip Reading
论文作者
论文摘要
由于深度学习的迅速发展及其广泛的潜在应用,近年来,唇读的研究兴趣越来越大。获得唇部阅读任务的良好性能的一个关键点在很大程度上取决于表示的有效性捕获唇部运动信息的有效性,同时抵制因姿势,照明条件,扬声器外观等的变化而产生的噪音。对于这个目标,我们建议在本地功能级别和全球序列级别上介绍相互信息约束,以增强特征与语音内容的关系。一方面,我们限制了每个时间步骤中产生的功能,以使它们与语音内容保持牢固的关系,通过强加局部相互信息最大化约束(LMIM),从而改善了模型的能力,可以发现发现细粒度的唇部运动的能力,并发现具有类似发音的单词之间的细粒度差异,例如``支出'''和````````''''''''''''''''。另一方面,我们介绍了对全局序列级别(GMIM)的共同信息最大化约束,以使模型能够更加关注与语音内容相关的歧视关键帧,而对演讲过程中出现的各种噪声则更少。通过将这两个优点结合在一起,预计提出的方法将具有歧视性和强大的唇部读数。为了验证这种方法,我们在两个大规模基准上进行了评估。我们对几个方面进行了详细的分析和比较,包括将LMIM和GMIM与基线的比较,学习表示的可视化等。结果不仅证明了所提出的方法的有效性,而且还报告了两个基准的新最先进的性能。
Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be to capture the lip movement information and meanwhile to resist the noises resulted from the change of pose, lighting conditions, speaker's appearance and so on. Towards this target, we propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level to enhance the relations of the features with the speech content. On the one hand, we constraint the features generated at each time step to enable them carry a strong relation with the speech content by imposing the local mutual information maximization constraint (LMIM), leading to improvements over the model's ability to discover fine-grained lip movements and the fine-grained differences among words with similar pronunciation, such as ``spend'' and ``spending''. On the other hand, we introduce the mutual information maximization constraint on the global sequence's level (GMIM), to make the model be able to pay more attention to discriminate key frames related with the speech content, and less to various noises appeared in the speaking process. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading. To verify this method, we evaluate on two large-scale benchmark. We perform a detailed analysis and comparison on several aspects, including the comparison of the LMIM and GMIM with the baseline, the visualization of the learned representation and so on. The results not only prove the effectiveness of the proposed method but also report new state-of-the-art performance on both the two benchmarks.