论文标题
DOM-LM:学习HTML文档的可推广表示形式
DOM-LM: Learning Generalizable Representations for HTML Documents
论文作者
论文摘要
HTML文档是在人类消费中传播信息的重要媒介。 HTML文档以多种文本格式提供信息,包括非结构化文本,结构化键值对和表。这些文档的有效表示对于机器的理解是必不可少的,以便能够广泛的应用程序,例如问答,网络搜索和个性化。现有工作要么使用通过在浏览器中渲染它们提取的视觉功能来表示这些文档,该功能通常在计算上昂贵,要么只是将它们视为纯文本文档,从而无法捕获其HTML结构中提供的有用信息。我们认为,文本和HTML结构共同传达了内容的重要语义,因此值得对其表示学习的特殊待遇。在本文中,我们介绍了一种称为DOM-LM的网页的新颖表示学习方法,该方法通过使用基于变压器的编码器来编码文本和DOM树结构来解决现有方法的局限性,并通过自我监督的预先培训来编码HTML文档的可通用表示。我们在各种网页理解任务上评估DOM-LM,包括属性提取,开放信息提取和问答。我们的广泛实验表明,DOM-LM始终优于为这些任务设计的所有基线。尤其是,DOM-LM在几次和零拍设置中都表现出更好的概括性能,使其有吸引力,适合于具有有限标记数据的现实世界应用程序设置。
HTML documents are an important medium for disseminating information on the Web for human consumption. An HTML document presents information in multiple text formats including unstructured text, structured key-value pairs, and tables. Effective representation of these documents is essential for machine understanding to enable a wide range of applications, such as Question Answering, Web Search, and Personalization. Existing work has either represented these documents using visual features extracted by rendering them in a browser, which is typically computationally expensive, or has simply treated them as plain text documents, thereby failing to capture useful information presented in their HTML structure. We argue that the text and HTML structure together convey important semantics of the content and therefore warrant a special treatment for their representation learning. In this paper, we introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches by encoding both text and DOM tree structure with a transformer-based encoder and learning generalizable representations for HTML documents via self-supervised pre-training. We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering. Our extensive experiments show that DOM-LM consistently outperforms all baselines designed for these tasks. In particular, DOM-LM demonstrates better generalization performance both in few-shot and zero-shot settings, making it attractive for making it suitable for real-world application settings with limited labeled data.