韩国法律语言理解和判断预测的多任务基准

论文标题

韩国法律语言理解和判断预测的多任务基准

A Multi-Task Benchmark for Korean Legal Language Understanding and Judgement Prediction

论文作者

Hwang, Wonseok, Lee, Dongjun, Cho, Kyoungyeon, Lee, Hanuhl, Seo, Minjoon

论文摘要

深度学习的最新进展极大地改变了机器学习，尤其是在自然语言处理领域的方式，可以应用于法律领域。但是，这种转移到数据驱动的方法需要更大，更多样化的数据集，但数量仍然很小，尤其是在非英语语言中。在这里，我们介绍了韩国法律AI数据集的第一个大规模基准，即Lox Open，其中包括一个法律语料库，两个分类任务，两个法律判决预测（LJP）任务和一项摘要任务。法律语料库由147,000万个先例（2.59亿代币）组成，其中63K在过去4年中被判刑，而96,000次是从第一和第二级法院进行审查的法院。这两个分类任务是案例名称（11.3k）和法规（2.8K）从个人案例的事实描述中预测。 LJP任务由（1）10.5k犯罪例子组成，其中要求该模型预测罚款，对劳动力的监禁以及没有劳动力的监禁范围，以及（2）4.7K民事实例，其中投入是事实和救济索赔和救济和产出的索赔学位。摘要任务包括最高法院的先例和相应的摘要（20K）。我们还通过将域（1）扩展到案例名称（31k示例）和法规（17.7K）分类任务的情况下的域（1），以及（2）在摘要任务（51K）中的长输入序列来释放数据集的现实变体。最后，我们发布了Lcube，这是第一个在本研究中对法律语料库进行培训的韩国法律语言模型。鉴于韩国法律的独特性以及这项工作中涵盖的法律任务的多样性，我们认为Lox Open有助于全球法律研究的多语言。 Lox Open和Lcube将公开使用。

The recent advances of deep learning have dramatically changed how machine learning, especially in the domain of natural language processing, can be applied to legal domain. However, this shift to the data-driven approaches calls for larger and more diverse datasets, which are nevertheless still small in number, especially in non-English languages. Here we present the first large-scale benchmark of Korean legal AI datasets, LBOX OPEN, that consists of one legal corpus, two classification tasks, two legal judgement prediction (LJP) tasks, and one summarization task. The legal corpus consists of 147k Korean precedents (259M tokens), of which 63k are sentenced in last 4 years and 96k are from the first and the second level courts in which factual issues are reviewed. The two classification tasks are case names (11.3k) and statutes (2.8k) prediction from the factual description of individual cases. The LJP tasks consist of (1) 10.5k criminal examples where the model is asked to predict fine amount, imprisonment with labor, and imprisonment without labor ranges for the given facts, and (2) 4.7k civil examples where the inputs are facts and claim for relief and outputs are the degrees of claim acceptance. The summarization task consists of the Supreme Court precedents and the corresponding summaries (20k). We also release realistic variants of the datasets by extending the domain (1) to infrequent case categories in case name (31k examples) and statute (17.7k) classification tasks, and (2) to long input sequences in the summarization task (51k). Finally, we release LCUBE, the first Korean legal language model trained on the legal corpus from this study. Given the uniqueness of the Law of South Korea and the diversity of the legal tasks covered in this work, we believe that LBOX OPEN contributes to the multilinguality of global legal research. LBOX OPEN and LCUBE will be publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题