论文标题
数据新闻的结构化,半结构和非结构化数据的图表集成
Graph integration of structured, semistructured and unstructured data for data journalism
论文作者
论文摘要
如今,新闻业的存在是由于存在大量数字数据源(包括许多开放数据源)所促进的。此类数据源极为异质,范围从高度结构的(关系数据库),半结构(JSON,XML,HTML),图(例如RDF)和文本。记者(以及其他缺乏高级IT专业知识的用户类别,例如大多数非政府组织或小型公共行政部门)也必须能够理解这种异质性语料库,即使他们缺乏自定义的提取提取物转换负载工作的能力。这些是DI邪教,不仅可以为任意的异质输入设置,而且考虑到用户可能希望将数据集添加到(来自)语料库中。我们描述了一种完整的方法,用于沿着上述线路集成了异质数据源的动态集:我们面临的挑战使这些图形有用,允许它们的集成扩展以及我们针对这些问题提出的解决方案。我们的方法是在连接系统系统中实现的;我们通过一组实验对其进行验证。
Nowadays, journalism is facilitated by the existence of large amounts of digital data sources, including many Open Data ones. Such data sources are extremely heterogeneous, ranging from highly struc-tured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to de ne and deploy custom extract-transform-load work ows. These are di cult to set up not only for arbitrary heterogeneous inputs , but also given that users may want to add (or remove) datasets to (from) the corpus. We describe a complete approach for integrating dynamic sets of heterogeneous data sources along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.