生物医学摘要的简单语言适应数据集

论文标题

生物医学摘要的简单语言适应数据集

A Dataset for Plain Language Adaptation of Biomedical Abstracts

论文作者

Attal, Kush, Ondov, Brian, Demner-Fushman, Dina

论文摘要

尽管已在网上向广泛的受众提供了指数增长的与健康相关的文献，但科学文章的语言可能很难让公众理解。因此，将这种专家级语言调整为普通语言版本是必须可靠地理解与健康相关的文献的必要条件。自动适应的深度学习算法是可能的解决方案。但是，需要黄金标准数据集进行适当的评估。迄今为止，拟议的数据集由成对的一对可比的专业和一般公共公共的文件或从此类文档中挖掘出的语义上类似句子的对。这会导致不完美的对齐和小测试集之间的权衡。为了解决这个问题，我们创建了生物医学摘要数据集的简单语言改编。该数据集是第一个手动调整的数据集，既可以进行文档和句子对准。该数据集包含750个改编的摘要，总计7643个句子对。除了描述数据集外，我们还使用最先进的深度学习方法对数据集进行了自动改编，从而为未来的研究设置了基准。

Though exponentially growing health-related literature has been made available to a broad audience online, the language of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation. Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with state-of-the-art Deep Learning approaches, setting baselines for future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题