论文标题
使用大型语言模型模拟多个人类并复制人类学科研究
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
论文作者
论文摘要
我们引入了一种称为图灵实验(TE)的新型测试,用于评估给定语言模型(例如GPT模型)在多大程度上可以模拟人类行为的不同方面。 TE还可以揭示语言模型对特定人类行为的模拟中的一致扭曲。与图灵测试涉及模拟单个任意个体不同,te需要模拟人类学科研究中参与者的代表性样本。我们执行试图从先前研究中复制良好发现的TE。我们设计了一种模拟TES的方法,并说明了它的用途,以比较不同语言模型能够重现经典的经济,心理语言和社会心理学实验:最后通atum游戏,花园路径句子,米尔格拉姆冲击实验和人群的智慧。在前三个TE中,现有的发现是使用最近模型复制的,而最后的TE揭示了某些语言模型(包括Chatgpt和GPT-4)中存在的“高临界畸变”,这可能会影响教育和艺术中的下游应用。
We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.