论文标题
以数据为中心的机器学习法律领域
Data-Centric Machine Learning in the Legal Domain
论文作者
论文摘要
机器学习研究通常从在此过程早期创建的固定数据集开始。实验的重点是找到一个模型和训练程序,从某些选定的评估度量标准方面,可以提高最佳性能。本文探讨了数据集的变化如何影响模型的测量性能。使用法律领域的三个公开数据集,我们研究了其大小,火车/测试拆分以及人类标记精度如何影响训练有素的深度学习分类器的性能。我们评估总体性能(加权平均)以及每级表现。观察到的效果令人惊讶地明显,尤其是当考虑到每类性能时。我们研究了一个类别的“语义同质性”,即语义嵌入空间中句子的接近性如何影响其分类的困难。提出的结果对与AI&Law领域的数据收集和策划有关的努力具有很大的影响。结果还表明,与ML模型的进步一起,可以将增强数据集的增强视为增加AI&Law中各种任务的分类性能的附加途径。最后,我们讨论了建立的方法来评估数据集属性的潜在影响的必要性。
Machine learning research typically starts with a fixed data set created early in the process. The focus of the experiments is finding a model and training procedure that result in the best possible performance in terms of some selected evaluation metric. This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance of a trained deep learning classifier. We assess the overall performance (weighted average) as well as the per-class performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered. We investigate how "semantic homogeneity" of a class, i.e., the proximity of sentences in a semantic embedding space, influences the difficulty of its classification. The presented results have far reaching implications for efforts related to data collection and curation in the field of AI & Law. The results also indicate that enhancements to a data set could be considered, alongside the advancement of the ML models, as an additional path for increasing classification performance on various tasks in AI & Law. Finally, we discuss the need for an established methodology to assess the potential effects of data set properties.