使用依赖性解析的代码开关句子创建

论文标题

使用依赖性解析的代码开关句子创建

Codeswitched Sentence Creation using Dependency Parsing

论文作者

Jain, Dhruval, Prabhu, Arun D, Vatsal, Shubham, Ramena, Gopi, Purre, Naresh

论文摘要

Codeswitching已成为世界上多语言演讲者中最常见的事件之一，尤其是在像印度这样的国家，其中包括大约23种官方语言，双语演讲者的数量约为3亿。在各种自然语言处理（NLP）任务方面，代码开关数据的稀缺成为探索该领域的瓶颈。因此，我们提出了一种新颖的算法，该算法利用英语语法的句法结构开发了英语印地语，英语 - 马拉蒂和英语 - 卡纳达数据的语法明智的代码开关版本。除了在很大程度上维持语法理智之外，我们的方法还保证了从给定数据的微小快照中大量生成数据。我们使用多个数据集来展示算法的功能，同时我们使用一些定性指标评估了生成的代码开关数据的质量，并为几个NLP任务提供基线结果。

Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题