统计数据集评估：可靠性，难度和有效性

论文标题

统计数据集评估：可靠性，难度和有效性

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

论文作者

Wang, Chengwen, Dong, Qingxiu, Wang, Xiaochen, Wang, Haitao, Sui, Zhifang

论文摘要

数据集是至关重要的培训资源和模型性能跟踪器。但是，现有数据集暴露了许多问题，引起了偏见的模型和不可靠的评估结果。在本文中，我们为自动数据集质量评估提出了一个模型不可复的数据集评估框架。我们寻求数据集的统计特性，并遵循三个基本维度：遵循经典测试理论，可靠性，难度和有效性。以指定的实体识别（NER）数据集为案例研究，我们为统计数据集评估框架介绍了$ 9 $统计指标。实验结果和人类评估验证了我们的评估框架有效地评估了数据集质量的各个方面。此外，我们研究数据集在统计指标上的评分如何影响模型性能，并在培训或测试模型之前呼吁数据集质量评估或目标数据集改进。

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题