论文标题
p-sif:使用平均分区的文档嵌入
P-SIF: Document Embeddings Using Partition Averaging
论文作者
论文摘要
单词向量的简单加权平均通常会为句子提供有效的表示,这些句子在许多任务中都超过了复杂的SEQ2SEQ神经模型。不幸的是,虽然希望使用相同的方法表示文档,但在表示涉及多个句子的长文档时,效率会丢失。关键原因之一是更长的文档可能包含许多不同主题的单词。因此,创建一个向量的同时忽略所有局部结构不太可能产生有效的文档表示。这个问题在单一句子和其他简短的文本片段中的急剧较小,在这些句子中很可能存在一个主题。为了减轻此问题,我们提出了P-SIF,这是一个划分的单词平均模型,以表示长文档。 P-SIF在考虑文档的局部结构时保留了简单加权单词平均的简单性。特别是,P-SIF从文档中学习特定于主题的向量,并最终将它们串联以代表整个文档。我们提供了关于P-SIF正确性的理论理由。通过一组全面的实验,我们证明了P-SIF的有效性与简单的加权平均和许多其他基线相比。
Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document's topical structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. We provide theoretical justifications on the correctness of P-SIF. Through a comprehensive set of experiments, we demonstrate P-SIF's effectiveness compared to simple weighted averaging and many other baselines.