使用表格语言模型在工业表中命名实体识别

论文标题

使用表格语言模型在工业表中命名实体识别

Named Entity Recognition in Industrial Tables using Tabular Language Models

论文作者

Koleva, Aneta, Ringsquandl, Martin, Buckley, Mark, Hasan, Rakebul, Tresp, Volker

论文摘要

用于编码表格数据的专业变压器模型对学术界引起了兴趣。尽管表格数据在行业中无所不在，但表变压器的应用仍缺少。在本文中，我们研究了如何将这些模型应用于工业命名实体识别（NER）问题，其中在表格的电子表格中提到了实体。电子表格的高度技术性以及缺乏标签数据的高度技术性，对基于微型变压器的模型提出了主要挑战。因此，我们基于可用域特异性知识图制定了专用的表数据增强策略。我们表明，这在我们的低资源场景中提高了性能。此外，我们研究了表格结构作为电感偏差的益处，与表作为线性化序列相比。我们的实验证实，表变压器的表现要优于其他基线，并且其表格感应偏置对于基于变压器模型的收敛至关重要。

Specialized transformer-based models for encoding tabular data have gained interest in academia. Although tabular data is omnipresent in industry, applications of table transformers are still missing. In this paper, we study how these models can be applied to an industrial Named Entity Recognition (NER) problem where the entities are mentioned in tabular-structured spreadsheets. The highly technical nature of spreadsheets as well as the lack of labeled data present major challenges for fine-tuning transformer-based models. Therefore, we develop a dedicated table data augmentation strategy based on available domain-specific knowledge graphs. We show that this boosts performance in our low-resource scenario considerably. Further, we investigate the benefits of tabular structure as inductive bias compared to tables as linearized sequences. Our experiments confirm that a table transformer outperforms other baselines and that its tabular inductive bias is vital for convergence of transformer-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题