关于印地语和泰米尔语查询的提取性问题

论文标题

关于印地语和泰米尔语查询的提取性问题

Extractive Question Answering on Queries in Hindi and Tamil

论文作者

Thirumala, Adhitya, Ferracane, Elisa

论文摘要

与英语等语言相比，自然语言处理（NLP）字段（NLP）领域的指示语言在自然语言处理（NLP）领域的代表性不足。由于这种代表性不足，因此，指示语言的NLP任务（例如搜索算法）的性能不如其英语对应。这种差异不成比例地影响来自较低社会经济地位的人，因为他们在本地语言中消耗了最多的互联网内容。该项目的目的是构建一个NLP模型，该模型比在印地语和泰米尔语的公共数据集上进行提取问题的任务（QA）的任务要好。提取质量检查是NLP任务，从相应的文本主体中提取问题的答案。为了构建最佳解决方案，我们使用了三种不同的模型。第一个模型是NLP模型Roberta的未经修改的跨语性版本，称为XLM-Roberta，在100种语言上介绍了。第二个模型基于验证的Roberta模型，该模型具有额外的分类头，用于答案，但我们使用了自定义的指示令牌，然后在Indic数据集上进行了优化的超参数并进行了微调。第三个模型基于XLM-Roberta，但在IND数据集上进行了额外的填充和培训。我们假设第三型模型将表现最佳，因为XLM-Roberta模型的多种多样，并且在INDED数据集上进行了额外的填充。该假设被证明是错误的，因为配对的罗伯塔模型表现最好，因为所使用的训练数据最特定于执行的任务，而不是XLM-ROBERTA模型，该模型的数据在印地语或泰米尔语中都没有。

Indic languages like Hindi and Tamil are underrepresented in the natural language processing (NLP) field compared to languages like English. Due to this underrepresentation, performance on NLP tasks (such as search algorithms) in Indic languages are inferior to their English counterparts. This difference disproportionately affects those who come from lower socioeconomic statuses because they consume the most Internet content in local languages. The goal of this project is to build an NLP model that performs better than pre-existing models for the task of extractive question-answering (QA) on a public dataset in Hindi and Tamil. Extractive QA is an NLP task where answers to questions are extracted from a corresponding body of text. To build the best solution, we used three different models. The first model is an unmodified cross-lingual version of the NLP model RoBERTa, known as XLM-RoBERTa, that is pretrained on 100 languages. The second model is based on the pretrained RoBERTa model with an extra classification head for the question answering, but we used a custom Indic tokenizer, then optimized hyperparameters and fine tuned on the Indic dataset. The third model is based on XLM-RoBERTa, but with extra finetuning and training on the Indic dataset. We hypothesize the third model will perform best because of the variety of languages the XLM-RoBERTa model has been pretrained on and the additional finetuning on the Indic dataset. This hypothesis was proven wrong because the paired RoBERTa models performed the best as the training data used was most specific to the task performed as opposed to the XLM-RoBERTa models which had much data that was not in either Hindi or Tamil.

下载PDF全文

下载文献需遵守相关版权规定

论文标题