论文标题

使分类器能够明确地与人类价值观保持一致

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

论文作者

Bang, Yejin, Yu, Tiezheng, Madotto, Andrea, Lin, Zhaojiang, Diab, Mona, Fung, Pascale

论文摘要

许多NLP分类任务,例如性别歧视/种族主义检测或毒性检测,都是基于人类价值观的。然而,在各种文化条件下,人类价值观可能会有所不同。因此,我们引入了一个价值分类的框架,该框架基于命令中的明确书写的人类价值观执行预测。除任务外,我们提出了一种实用方法,该方法将价值一致的知识从大规模语言模型(LLMS)提取到两个步骤构建价值分类的分类器。首先,我们通过迅速基于几次学习来从LLM中生成价值一致的培训数据。接下来,我们将较小的分类模型与生成的数据进行微调。经验结果表明,我们的VA模型在F1分数上至少超过15.56%的基线,其中包括使用OPT-175B进行的几次学习和现有的文本增强方法。我们建议,使用具有明确人值输入的分类器可以提高AI的包容性和解释性。

Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity & explainability in AI.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源