噪声中的信号：探索用字符刻录语言模型以随机字符序列编码的含义

论文标题

噪声中的信号：探索用字符刻录语言模型以随机字符序列编码的含义

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

论文作者

Chu, Mark, Desikan, Bhargav Srinivasa, Nadler, Ethan O., Sardo, D. Ruggiero Lo, Darragh-Ford, Elise, Guilbeault, Douglas

论文摘要

自然语言处理模型基于分布假设来学习单词表示，该假设断言单词上下文（例如，共同出现）与意义相关。我们建议，由随机字符序列或$ groble $组成的$ n $ grams为研究含义在现存语言内外的含义提供了一种新颖的背景。特别是，随机生成的字符$ n $ grams缺乏含义，但根据其包含的字符分布包含原始信息。通过研究使用Cartarebert的大量垃圾，现存语言和伪字样的嵌入，我们在模型的高维嵌入空间中识别一个轴，将这些类别的$ n $ gram分开。此外，我们表明该轴与现存语言中的结构有关，包括词性词性部分，形态和概念具体性。因此，与主要限于现存语言的研究相反，我们的工作表明，含义和原始信息本质上是链接的。

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.

下载PDF全文

下载文献需遵守相关版权规定

论文标题