论文标题
Cordel:实体联系的对比深度学习方法
CorDEL: A Contrastive Deep Learning Approach for Entity Linkage
论文作者
论文摘要
实体链接(EL)是数据清洁和集成中的关键问题。在过去的几十年中,EL通常是通过基于规则的系统或具有手工策划功能的传统机器学习模型完成的,这两者都在很大程度上取决于手动人类的输入。随着新数据的不断增长,已经提出了基于深度学习(DL)的方法来减轻与传统模型相关的高成本。现有对EL的DL模型的探索严格遵循了著名的双网络架构。但是,我们认为双网网架构对EL是最佳的,导致了现有模型的固有缺点。为了解决弊端,我们为EL提出了一个新颖而通用的对比DL框架。所提出的框架能够捕获句法和语义匹配的信号,并注意细微但关键的差异。基于框架,我们为EL开发了一种称为Cordel的对比度DL方法,具有三种强大的变体。我们通过在公共基准数据集和现实世界数据集上进行的广泛实验来评估Cordel。在公共基准数据集上,Cordel的表现优于先前的最新模型5.2%。此外,Cordel在现实世界数据集上的当前最佳DL模型的提高2.4%,同时将训练参数的数量减少了97.6%。
Entity linkage (EL) is a critical problem in data cleaning and integration. In the past several decades, EL has typically been done by rule-based systems or traditional machine learning models with hand-curated features, both of which heavily depend on manual human inputs. With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost of EL associated with the traditional models. Existing exploration of DL models for EL strictly follows the well-known twin-network architecture. However, we argue that the twin-network architecture is sub-optimal to EL, leading to inherent drawbacks of existing models. In order to address the drawbacks, we propose a novel and generic contrastive DL framework for EL. The proposed framework is able to capture both syntactic and semantic matching signals and pays attention to subtle but critical differences. Based on the framework, we develop a contrastive DL approach for EL, called CorDEL, with three powerful variants. We evaluate CorDEL with extensive experiments conducted on both public benchmark datasets and a real-world dataset. CorDEL outperforms previous state-of-the-art models by 5.2% on public benchmark datasets. Moreover, CorDEL yields a 2.4% improvement over the current best DL model on the real-world dataset, while reducing the number of training parameters by 97.6%.