评估语法误差校正的大规模合成数据

论文标题

评估语法误差校正的大规模合成数据

Evaluation of large-scale synthetic data for Grammar Error Correction

论文作者

Kumar, Vanya Bannihatti

论文摘要

语法误差校正（GEC）主要取决于大量的语法正确句子和错误句子对的大量合成平行数据的可用性。评估了合成数据的质量，以预先培训使用它的GEC系统的性能。但这并不能为定义这些数据质量的必要因素提供太多洞察力。因此，这项工作旨在引入3个指标 - 可靠性，多样性和分销匹配，以提供更多了解GEC任务生成的大规模合成数据的质量，并自动评估它们。自动评估这三个指标也可以帮助向数据生成系统提供反馈，从而提高动态生成的合成数据的质量

Grammar Error Correction(GEC) mainly relies on the availability of high quality of large amount of synthetic parallel data of grammatically correct and erroneous sentence pairs. The quality of the synthetic data is evaluated on how well the GEC system performs when pre-trained using it. But this does not provide much insight into what are the necessary factors which define the quality of these data. So this work aims to introduce 3 metrics - reliability, diversity and distribution match to provide more insight into the quality of large-scale synthetic data generated for the GEC task, as well as automatically evaluate them. Evaluating these three metrics automatically can also help in providing feedback to the data generation systems and thereby improve the quality of the synthetic data generated dynamically

下载PDF全文

下载文献需遵守相关版权规定

论文标题