论文标题
Muld:多任务长文件基准
MuLD: The Multitask Long Document Benchmark
论文作者
论文摘要
NLP技术的令人印象深刻的进步是由多任务基准(例如胶水和超级胶水)的开发驱动的。虽然这些基准专注于一个或两个输入句子的任务,但在设计有效的技术来处理更长的输入方面,仍有令人兴奋的工作。在本文中,我们介绍了Muld:一种新的长文档基准,仅包含10,000多个代币的文档。通过修改现有的NLP任务,我们创建了一个多样化的基准测试,该基准要求模型成功地模拟文本中的长期依赖性。我们评估现有模型的性能,并发现我们的基准比“简短文档”等效物更具挑战性。此外,通过评估规则和有效的变压器,我们表明,上下文长度增加的模型可以更好地解决所提供的任务,这表明这些模型的未来改进对于解决类似的长期文档问题至关重要。我们发布基线的数据和代码,以鼓励对有效的NLP模型进行进一步研究。
The impressive progress in NLP techniques has been driven by the development of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks focus on tasks for one or two input sentences, there has been exciting work in designing efficient techniques for processing much longer inputs. In this paper, we present MuLD: a new long document benchmark consisting of only documents over 10,000 tokens. By modifying existing NLP tasks, we create a diverse benchmark which requires models to successfully model long-term dependencies in the text. We evaluate how existing models perform, and find that our benchmark is much more challenging than their `short document' equivalents. Furthermore, by evaluating both regular and efficient transformers, we show that models with increased context length are better able to solve the tasks presented, suggesting that future improvements in these models are vital for solving similar long document problems. We release the data and code for baselines to encourage further research on efficient NLP models.