论文标题

深层词汇假设:识别自然语言的人格结构

Deep Lexical Hypothesis: Identifying personality structure in natural language

论文作者

Cutler, Andrew, Condon, David M.

论文摘要

自然语言处理(NLP)的最新进展已经产生了可以执行复杂任务的通用模型,例如总结长段落和跨语言翻译。在这里,我们介绍了一种方法,可以从语言模型中提取形容词相似性,以及在传统的心理研究中基于调查的评分,但在自然环境中使用了数百万次的文本。通过这种方法产生的相关结构与Saucier和Goldberg(1996a)报道的435个术语的自我和其他评估高度相似。使用NLP产生的前三个无脊椎因素与调查数据中的系数一致,系数为0.89、0.79和0.79。这种结构对于许多建模决策是可靠的:形容词集,包括具有1,710个任期的形容词(Goldberg,1982)和18,000个任期(Allport&Odbert,1936年);用于提取相关性的查询;和语言模型。值得注意的是,神经质和开放性仅是微弱和不一致的恢复。这是一个新的信号来源,它更接近词汇假设的原始(语义)视觉。可以在调查无法进行的情况下应用该方法:在数十种语言中,具有数以万计的项目,历史文本,并且以极大的成本为本。该守则公开以促进在新的研究方向上促进复制和快速迭代。

Recent advances in natural language processing (NLP) have produced general models that can perform complex tasks such as summarizing long passages and translating across languages. Here, we introduce a method to extract adjective similarities from language models as done with survey-based ratings in traditional psycholexical studies but using millions of times more text in a natural setting. The correlational structure produced through this method is highly similar to that of self- and other-ratings of 435 terms reported by Saucier and Goldberg (1996a). The first three unrotated factors produced using NLP are congruent with those in survey data, with coefficients of 0.89, 0.79, and 0.79. This structure is robust to many modeling decisions: adjective set, including those with 1,710 terms (Goldberg, 1982) and 18,000 terms (Allport & Odbert, 1936); the query used to extract correlations; and language model. Notably, Neuroticism and Openness are only weakly and inconsistently recovered. This is a new source of signal that is closer to the original (semantic) vision of the Lexical Hypothesis. The method can be applied where surveys cannot: in dozens of languages simultaneously, with tens of thousands of items, on historical text, and at extremely large scale for little cost. The code is made public to facilitate reproduction and fast iteration in new directions of research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源