论文标题
通过分裂和纠纷来改善句子对准
Improve Sentence Alignment by Divide-and-conquer
论文作者
论文摘要
在本文中,我们介绍了一种分裂和争议算法,以提高句子对准速度。我们利用外部双语句子嵌入来查找要对齐的并行文本的准确硬性分隔器。我们使用Monte Carlo模拟来实验表明,使用这种分裂和串用算法,我们可以将任何二次时代复杂性句子对准算法变成具有O(NLOGN)平均时间复杂性的算法。在标准OCR生成的数据集上,我们的方法将Bleualign基线提高了3 f1点。此外,当限制计算资源时,我们的算法在实践中比vecalign快。
In this paper, we introduce a divide-and-conquer algorithm to improve sentence alignment speed. We utilize external bilingual sentence embeddings to find accurate hard delimiters for the parallel texts to be aligned. We use Monte Carlo simulation to show experimentally that using this divide-and-conquer algorithm, we can turn any quadratic time complexity sentence alignment algorithm into an algorithm with average time complexity of O(NlogN). On a standard OCR-generated dataset, our method improves the Bleualign baseline by 3 F1 points. Besides, when computational resources are restricted, our algorithm is faster than Vecalign in practice.