论文标题
差距上的差距:解决偏见测量数据集中不同数据分布的问题
The Gap on GAP: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets
论文作者
论文摘要
可以检测有偏见模型的诊断数据集是自然语言处理中降低偏见的重要先决条件。但是,收集到的数据中的不希望的模式可能使此类测试不正确。例如,如果性别偏见测量核心分辨率数据集的女性子集包含代词和正确候选人之间平均距离较长距离的句子,则基于RNN的模型在该子集上的性能可能更糟。在这项工作中,我们引入了一种理论上接地的方法,用于加权测试样品,以应对测试数据中的此类模式。我们在GAP数据集上演示了核心分辨率的方法。我们用所有个人名称的跨度注释差距,并显示女性子集中的示例包含更多的个人名称和更长的代词与其指南之间的距离,可能会以不希望的方式影响偏见分数。使用我们的加权方法,我们在测试实例上找到了应用于应对这些相关性的一组权重,并且我们重新评估了16个最近发布的核心模型。
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing. However, undesired patterns in the collected data can make such tests incorrect. For example, if the feminine subset of a gender-bias-measuring coreference resolution dataset contains sentences with a longer average distance between the pronoun and the correct candidate, an RNN-based model may perform worse on this subset due to long-term dependencies. In this work, we introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data. We demonstrate the method on the GAP dataset for coreference resolution. We annotate GAP with spans of all personal names and show that examples in the female subset contain more personal names and a longer distance between pronouns and their referents, potentially affecting the bias score in an undesired way. Using our weighting method, we find the set of weights on the test instances that should be used for coping with these correlations, and we re-evaluate 16 recently released coreference models.