论文标题

瘟疫点文字:第三瘟疫大流行的爆发报告的文本挖掘和注释(1894-1952)

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

论文作者

Casey, Arlene, Bennett, Mike, Tobin, Richard, Grover, Claire, Walker, Iona, Engelmann, Lukas, Alex, Beatrice

论文摘要

在过去的爆发中收集的信息和数据通常建立了治理人口疾病的模型的设计。但是,仅在统计数据中就从来没有捕获过流行病暴发,而是由经验观察支持的叙述传达的。爆发报告讨论了种群,位置和疾病之间的相关性,以推断出对原因,向量和潜在干预措施的见解。这些叙述的问题通常是缺乏一致的结构或强有力的惯例,这禁止在较大的语料库中进行正式分析。我们的跨学科研究调查了第三宫大流行(1894-1952)的100多个报告,评估了通过文本挖掘和手动注释来构建语料库来提取和构建此叙述信息的方法。在本文中,我们讨论了正在进行的探索项目的进步,我们如何增强光学特征识别(OCR)方法来改善文本捕获,我们构造叙事的方法并确定报告中的相关实体。结构化语料库可通过Solr启用整个集合中的搜索和分析,以供将来的研究专门用于概念的识别。我们显示了因句法类别依赖性语料库统计的导致因果关系特征和性别差异的初步可视化。我们的目标是开发一些最重要的概念的结构化叙述,这些概念被用来了解全球第三大瘟疫大流行病的流行病学。该语料库使研究人员能够共同分析报告,允许深入了解20世纪初期瘟疫的全球流行病学考虑因素。

The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源