论文标题

数字化历史资产负债表数据:从业者指南

Digitizing Historical Balance Sheet Data: A Practitioner's Guide

论文作者

Correia, Sergio, Luck, Stephan

论文摘要

本文讨论了如何通过使用预处理和后处理方法增强光学特征识别(OCR)发动机来成功数字化大规模的历史微数据。尽管由于机器学习的改善,近年来OCR软件已大大改善,但现成的OCR应用程序仍然显示高错误率,这限制了其应用程序以准确提取结构化信息。但是,补充OCR可以大大提高其成功率,使其成为经济历史学家的强大和成本效益的工具。本文展示了这些方法,并解释了为什么它们有用。我们将它们应用于两个大型资产负债表数据集,并引入Quipucamayoc,这是一种在统一框架中包含这些方法的Python软件包。

This paper discusses how to successfully digitize large-scale historical micro-data by augmenting optical character recognition (OCR) engines with pre- and post-processing methods. Although OCR software has improved dramatically in recent years due to improvements in machine learning, off-the-shelf OCR applications still present high error rates which limit their applications for accurate extraction of structured information. Complementing OCR with additional methods can however dramatically increase its success rate, making it a powerful and cost-efficient tool for economic historians. This paper showcases these methods and explains why they are useful. We apply them against two large balance sheet datasets and introduce quipucamayoc, a Python package containing these methods in a unified framework.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源