ANAMETA：一张表格的表格数据集通过多维数据分析任务共享的字段元数据知识

论文标题

ANAMETA：一张表格的表格数据集通过多维数据分析任务共享的字段元数据知识

AnaMeta: A Table Understanding Dataset of Field Metadata Knowledge Shared by Multi-dimensional Data Analysis Tasks

论文作者

He, Xinyi, Zhou, Mengyu, Zhou, Mingjie, Xu, Jialiang, Lv, Xiao, Li, Tianle, Shao, Yijia, Han, Shi, Yuan, Zejian, Zhang, Dongmei

论文摘要

每天在各个领域进行表格数据分析。它需要对字段语义的准确理解才能在表字段上正确操作并在日常分析中找到常见模式。在本文中，我们介绍了ANAMETA数据集，该数据集是四种常用的现场元数据的467K表，其中包含带有派生的监督标签：测量/维度二分法，常见野外角色，语义场类型和默认聚合函数。我们评估了将元数据作为基准的广泛模型。我们还提出了一个称为KDF的多编码框架，该框架通过合并分布和知识信息来提高表格模型的元数据理解能力。此外，我们提出了将场元数据纳入下游分析任务的四个接口。

Tabular data analysis is performed every day across various domains. It requires an accurate understanding of field semantics to correctly operate on table fields and find common patterns in daily analysis. In this paper, we introduce the AnaMeta dataset, a collection of 467k tables with derived supervision labels for four types of commonly used field metadata: measure/dimension dichotomy, common field roles, semantic field type, and default aggregation function. We evaluate a wide range of models for inferring metadata as the benchmark. We also propose a multi-encoder framework, called KDF, which improves the metadata understanding capability of tabular models by incorporating distribution and knowledge information. Furthermore, we propose four interfaces for incorporating field metadata into downstream analysis tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题