论文标题
噪声中的信号:探索用字符刻录语言模型以随机字符序列编码的含义
Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models
论文作者
论文摘要
自然语言处理模型基于分布假设来学习单词表示,该假设断言单词上下文(例如,共同出现)与意义相关。我们建议,由随机字符序列或$ groble $组成的$ n $ grams为研究含义在现存语言内外的含义提供了一种新颖的背景。特别是,随机生成的字符$ n $ grams缺乏含义,但根据其包含的字符分布包含原始信息。通过研究使用Cartarebert的大量垃圾,现存语言和伪字样的嵌入,我们在模型的高维嵌入空间中识别一个轴,将这些类别的$ n $ gram分开。此外,我们表明该轴与现存语言中的结构有关,包括词性词性部分,形态和概念具体性。因此,与主要限于现存语言的研究相反,我们的工作表明,含义和原始信息本质上是链接的。
Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.