论文标题

板:列表页面Web提取的大规模数据集

PLAtE: A Large-scale Dataset for List Page Web Extraction

论文作者

San, Aidan, Zhuang, Yuan, Bakus, Jan, Lockard, Colin, Ciemiewicz, David, Atluri, Sandeep, Ji, Yangfeng, Small, Kevin, Elfardy, Heba

论文摘要

最近,已利用神经模型可以显着提高半结构化网站的信息提取的性能。但是,持续进展的障碍是少数数据集足以训练这些模型。在这项工作中,我们介绍了板(列表属性提取的页面)基准数据集,作为一项挑战的新Web提取任务。 Plate专注于购物数据,特别是从产品复习页面中提取的,其中包含以下任务的多个项目:(1)找到产品列表分割边界和(2)为每种产品提取属性。板由6、694页和156、014属性收集的52、898个项目组成,使其成为第一个LargesCale列表页面Web提取数据集。我们使用多阶段方法来收集和注释数据集,并将三种最先进的Web提取模型适应两个任务,以定量和定性地比较其优点和缺点。

Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源