使分类器能够明确地与人类价值观保持一致

论文标题

使分类器能够明确地与人类价值观保持一致

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

论文作者

Bang, Yejin, Yu, Tiezheng, Madotto, Andrea, Lin, Zhaojiang, Diab, Mona, Fung, Pascale

论文摘要

许多NLP分类任务，例如性别歧视/种族主义检测或毒性检测，都是基于人类价值观的。然而，在各种文化条件下，人类价值观可能会有所不同。因此，我们引入了一个价值分类的框架，该框架基于命令中的明确书写的人类价值观执行预测。除任务外，我们提出了一种实用方法，该方法将价值一致的知识从大规模语言模型（LLMS）提取到两个步骤构建价值分类的分类器。首先，我们通过迅速基于几次学习来从LLM中生成价值一致的培训数据。接下来，我们将较小的分类模型与生成的数据进行微调。经验结果表明，我们的VA模型在F1分数上至少超过15.56％的基线，其中包括使用OPT-175B进行的几次学习和现有的文本增强方法。我们建议，使用具有明确人值输入的分类器可以提高AI的包容性和解释性。

Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity & explainability in AI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题