Histruct+：通过层次结构信息改进提取文本摘要

论文标题

Histruct+：通过层次结构信息改进提取文本摘要

HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information

论文作者

Ruan, Qian, Ostendorff, Malte, Rehm, Georg

论文摘要

基于变压器的语言模型通常将文本视为线性序列。但是，大多数文本还具有固有的层次结构，即可以使用其在该层次结构中的位置来识别文本的一部分。此外，标题通常表示其各自句子的共同主题。我们提出了一种新的方法，以根据预先训练的，仅编码的变压器语言模型（HISTRUCT+模型）明确地将提出，提取，编码和注入层次结构信息明确地介绍到提取性摘要模型中，从而改善了Sota Rouges，以实质性的PubMed和Arxiv上的提取性摘要。使用三个数据集（即CNN/Dailymail，PubMed和Arxiv）上的各种实验设置，我们的HISCTICT+模型集体优于强大的基线，这仅与我们的模型不同，因为层次结构信息未注入。还可以观察到，数据集具有的更明显的层次结构，我们的方法增益的较大改进。消融研究表明，分层位置信息是我们模型SOTA性能的主要贡献者。

Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractive summarization on PubMed and arXiv substantially. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively, which differs from our model only in that the hierarchical structure information is not injected. It is also observed that the more conspicuous hierarchical structure the dataset has, the larger improvements our method gains. The ablation study demonstrates that the hierarchical position information is the main contributor to our model's SOTA performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题