使用单词资本化和标点符号恢复模型从语音中改善越南人命名实体识别

论文标题

使用单词资本化和标点符号恢复模型从语音中改善越南人命名实体识别

Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models

论文作者

Nguyen, Thai Binh, Nguyen, Quang Minh, Nguyen, Thi Thu Hien, Do, Quoc Truong, Luong, Chi Mai

论文摘要

关于指定实体识别（NER）任务的研究表现出了出色的结果，这些结果在具有正确的文本格式的输入文本上达到了人类的均衡，例如具有适当的标点符号和大写字母。但是，由于文本是根据语音识别系统（ASR）生成的，因此在输入为语音的应用程序中无法使用此类条件，并且该系统不考虑文本格式。在本文中，我们（1）介绍了第一个用于NER任务的越南语音数据集，以及（2）越南人的第一个预先培训的公共大型单语言模型，该模型与最新研究相比，越南NER任务的新最先进的越南NER任务是1.3％的绝对F1分数。最后，（3）我们从语音中提出了一条新的ner任务管道，通过将文本大写和标点符号恢复模型（CAPU）引入管道来克服文本格式问题。该模型从ASR系统中获取输入文本，并同时执行两个任务，从而产生适当的文本格式，有助于提高NER性能。实验结果表明，CAPU模型有助于提高F1得分的近4％。

Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model (CaPu) into the pipeline. The model takes input text from an ASR system and performs two tasks at the same time, producing proper text formatting that helps to improve NER performance. Experimental results indicated that the CaPu model helps to improve by nearly 4% of F1-score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题