论文标题
法律文件的结构文本细分
Structural Text Segmentation of Legal Documents
论文作者
论文摘要
法律案件的复杂性日益严重,导致人们对法律信息检索系统的兴趣越来越大,这些系统可以有效地满足特定于用户的信息需求。但是,这样的下游系统通常需要对文档进行正确的格式和分割,这通常是通过相对简单的预处理步骤来完成的,而无视段的主题相干性。系统通常依赖于单个句子或段落的表示,这些句子或段落可能缺乏关键上下文或文档级表示,这对于有意义的搜索结果太长了。为了解决这个问题,我们提出了一个分割系统,该系统可以预测跨越几段段落的顺序文本段的局部连贯性,有效地分割了文档并为下游应用程序提供更平衡的表示。我们通过执行一系列独立的分类来构建流行变压器网络之上的模型,并将结构文本细分作为局部变化检测,这些分类允许对特定于任务的数据有效进行微调。我们抓取了一个新颖的数据集,该数据集由大约74,000美元的在线服务条款文件组成,包括分层主题注释,我们将其用于培训。结果表明,我们提议的系统大大胜过基线,并且很好地适应了法律文件的结构性特征。我们将数据和受过训练的模型发布给研究社区以进行未来工作。
The growing complexity of legal cases has lead to an increasing interest in legal information retrieval systems that can effectively satisfy user-specific information needs. However, such downstream systems typically require documents to be properly formatted and segmented, which is often done with relatively simple pre-processing steps, disregarding topical coherence of segments. Systems generally rely on representations of individual sentences or paragraphs, which may lack crucial context, or document-level representations, which are too long for meaningful search results. To address this issue, we propose a segmentation system that can predict topical coherence of sequential text segments spanning several paragraphs, effectively segmenting a document and providing a more balanced representation for downstream applications. We build our model on top of popular transformer networks and formulate structural text segmentation as topical change detection, by performing a series of independent classifications that allow for efficient fine-tuning on task-specific data. We crawl a novel dataset consisting of roughly $74,000$ online Terms-of-Service documents, including hierarchical topic annotations, which we use for training. Results show that our proposed system significantly outperforms baselines, and adapts well to structural peculiarities of legal documents. We release both data and trained models to the research community for future work.https://github.com/dennlinger/TopicalChange