论文标题
逻辑信息有助于学习更好的自然语言推论的表示
Logographic Information Aids Learning Better Representations for Natural Language Inference
论文作者
论文摘要
统计语言模型通常基于单词或其他正式单元的上下文分布实现表示表示学习,而与书面文本的逻辑特征相关的任何信息通常都被忽略,假设应依靠同时性统计数据来检索它们。另一方面,随着语言模型变得更大,需要更多的数据来学习可靠的表示形式,因此这种假设可能会开始倒退,尤其是在数据稀疏条件下。许多语言,包括中文和越南语,都使用逻辑写作系统,其中表面形式被表示为较小的绘画单元的视觉组织,通常包含许多语义提示。在本文中,我们提出了一项新颖的研究,该研究探讨了在学习更好的语义表示中提供logographic信息的好处。我们通过评估计算多模式表示的好处来测试自然语言推理(NLI)任务的假设,该代表将上下文信息与Glyph信息结合在一起。我们的评估结果具有不同的类型和写作系统的六种语言,这表明使用具有logograhic系统的语言中的多模式嵌入,尤其是出现统计量较少的单词。
Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.