论文标题
检测研究数据中的质量问题:一种模型驱动的方法
Detecting Quality Problems in Research Data: A Model-Driven Approach
论文作者
论文摘要
由于科学进步在很大程度上取决于研究数据的质量,因此对来自科学界的数据质量有严格的要求。数据质量保证的主要挑战是本地化数据固有的质量问题。由于特定科学领域(尤其是人文科学)的动态数字化,因此可以以相当短的方式使用不同的数据库技术和数据格式来获得经验。我们提出了一种模型驱动的方法来分析研究数据的质量。它允许从基础数据库技术中抽象。基于许多质量问题显示反故事的观察,数据工程师制定了有关数据库格式和技术的通用分析模式。域专家选择了一种已适应特定数据库技术的模式,并将其用于特定于域的数据库格式。数据分析师使用所得的混凝土模式在其数据库中找到质量问题。作为概念证明,我们实施了工具支持,该工具支持实现了XML数据库的这种方法。我们根据对文化遗产数据中发生的质量问题的定性研究,评估了有关文化遗产领域表达和表现的方法。
As scientific progress highly depends on the quality of research data, there are strict requirements for data quality coming from the scientific community. A major challenge in data quality assurance is to localise quality problems that are inherent to data. Due to the dynamic digitalisation in specific scientific fields, especially the humanities, different database technologies and data formats may be used in rather short terms to gain experiences. We present a model-driven approach to analyse the quality of research data. It allows abstracting from the underlying database technology. Based on the observation that many quality problems show anti-patterns, a data engineer formulates analysis patterns that are generic concerning the database format and technology. A domain expert chooses a pattern that has been adapted to a specific database technology and concretises it for a domain-specific database format. The resulting concrete patterns are used by data analysts to locate quality problems in their databases. As proof of concept, we implemented tool support that realises this approach for XML databases. We evaluated our approach concerning expressiveness and performance in the domain of cultural heritage based on a qualitative study on quality problems occurring in cultural heritage data.