论文标题
Tabert:仔细研究文本和表格数据的联合理解
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
论文作者
论文摘要
近年来见证了基于文本的自然语言(NL)理解任务的验证语言模型(LMS)的新兴。这样的模型通常是在自由形式的NL文本上训练的,因此可能不适用于诸如在结构化数据上进行语义解析之类的任务,这些任务需要在自由形式的NL问题和结构化表格数据(例如数据库表)上进行推理。在本文中,我们介绍了Tabert,这是一种经过验证的LM,共同学习NL句子和(半)结构化表的表示。塔伯特(Tabert)接受了2600万张桌子及其英语环境的大型语料库的培训。在实验中,使用Tabert用作特征表示层的神经语义解析器在具有挑战性的弱监督语义解析基准基准WikigableQuestions上取得了新的最佳结果,同时在文本到SQL数据集蜘蛛上进行了竞争性。该模型的实施将在http://fburl.com/tabert上找到。
Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider. Implementation of the model will be available at http://fburl.com/TaBERT .