TUTA：基于树基的变压器，用于一般结构化表的预训练

论文标题

TUTA：基于树基的变压器，用于一般结构化表的预训练

TUTA: Tree-based Transformers for Generally Structured Table Pre-training

论文作者

Wang, Zhiruo, Dong, Haoyu, Jia, Ran, Li, Jia, Fu, Zhiyi, Han, Shi, Zhang, Dongmei

论文摘要

表被广泛用于各种结构来组织和呈现数据。桌子上的最新尝试主要集中在关系表上，而忽略了其他常见的表结构。在本文中，我们提出了TUTA，这是一种统一的预训练架构，用于理解一般结构化的表。注意到理解表需要空间，分层和语义信息，我们通过三种新型的结构感知机制来增强变形金刚。首先，我们设计了一个统一的基于树的结构，称为双维坐标树，以描述一般结构化表的空间和分层信息。为此，我们提出了基于树木的注意力和嵌入位置，以更好地捕获空间和分层信息。此外，我们设计了三个渐进的预训练目标，以在令牌，单元格和表级别上启用表示形式。我们在广泛的未标记的Web和电子表格表上预先培训TUTA，并在表结构理解领域的两个关键任务上对其进行微调：单元格类型分类和表类型分类。实验表明，图塔（Tuta）非常有效，可以在五个广泛研究的数据集上实现最先进的方法。

Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题