人群得分：使用大型语言模型AI选民评估笑话的方法

论文标题

人群得分：使用大型语言模型AI选民评估笑话的方法

Crowd Score: A Method for the Evaluation of Jokes using Large Language Model AI Voters as Judges

论文作者

Goes, Fabricio, Zhou, Zisen, Sawicki, Piotr, Grzes, Marek, Brown, Daniel G.

论文摘要

本文介绍了人群分数，这是一种新颖的方法，可以使用大型语言模型（LLMS）作为AI法官评估笑话的有趣性。我们的方法依赖于诱使不同的个性进入LLM，并将AI法官的选票汇总为单个评分来评估笑话。我们使用审计技术来验证投票，该技术使用LLM检查特定投票的解释是否合理。我们在四个AI选民的人群中对52个笑话进行了测试，这些选民具有不同的幽默类型：会员，自我增强，积极进取和自欺欺人。我们的结果表明，对于投票问题而言，很少有射击促使您的结果比零击更好。人格归纳表明，与联盟和自我增强的选民相比，积极进取和自欺欺人的选民更倾向于发现对一系列积极/自欺欺人的笑话更多的笑话。人群得分遵循与人类法官相同的趋势，通过将更高的分数分配给人类法官也更有趣的笑话。我们认为，我们的方法可以应用于其他创造性领域，例如故事，诗歌，口号等。它都可以帮助采用灵活而准确的标准方法，以比较常见的指标下的CC社区的不同工作，并通过最大程度地减少人类参与评估创意文物的参与，可以使创意艺术的原产品降低人类的成本量，并降低了人类成本的量身定期。

This paper presents the Crowd Score, a novel method to assess the funniness of jokes using large language models (LLMs) as AI judges. Our method relies on inducing different personalities into the LLM and aggregating the votes of the AI judges into a single score to rate jokes. We validate the votes using an auditing technique that checks if the explanation for a particular vote is reasonable using the LLM. We tested our methodology on 52 jokes in a crowd of four AI voters with different humour types: affiliative, self-enhancing, aggressive and self-defeating. Our results show that few-shot prompting leads to better results than zero-shot for the voting question. Personality induction showed that aggressive and self-defeating voters are significantly more inclined to find more jokes funny of a set of aggressive/self-defeating jokes than the affiliative and self-enhancing voters. The Crowd Score follows the same trend as human judges by assigning higher scores to jokes that are also considered funnier by human judges. We believe that our methodology could be applied to other creative domains such as story, poetry, slogans, etc. It could both help the adoption of a flexible and accurate standard approach to compare different work in the CC community under a common metric and by minimizing human participation in assessing creative artefacts, it could accelerate the prototyping of creative artefacts and reduce the cost of hiring human participants to rate creative artefacts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题