论文标题
数字人文学科的数据湖泊
Data Lakes for Digital Humanities
论文作者
论文摘要
数字人文学科项目中的传统数据具有各种格式(结构化,半结构化,文本),并且需要进行实质性转换(编码和标记,茎,柠檬水等)才能进行管理和分析。为了充分掌握此过程,我们建议将数据湖泊用作数据孤岛和大数据品种问题的解决方案。我们描述了我们目前与人文和社会科学研究人员密切合作的数据湖项目,并讨论了经营这些项目的经验教训。
Traditional data in Digital Humanities projects bear various formats (structured, semi-structured, textual) and need substantial transformations (encoding and tagging, stemming, lemmatization, etc.) to be managed and analyzed. To fully master this process, we propose the use of data lakes as a solution to data siloing and big data variety problems. We describe data lake projects we currently run in close collaboration with researchers in humanities and social sciences and discuss the lessons learned running these projects.