论文标题

Dexter:从电子医疗健康文件中提取表内容的端到端系统

DEXTER: An end-to-end system to extract table contents from electronic medical health documents

论文作者

PR, Nandhinee, Krishnamoorthy, Harinath, Srivatsan, Koushik, Goyal, Anil, Santhiappan, Sudarsun

论文摘要

在本文中,我们提出了Dexter,这是一个端到端系统,以从医疗保健文件中存在的表中提取信息,例如电子健康记录(EHR)和福利解释(EOB)。 Dexter由四个子系系统阶段组成:i)表检测ii)表类型分类iii)细胞检测;和iv)细胞含量提取。我们建议使用CDEC-NET体系结构以及用于表检测的非最大抑制作用,提出一种基于两阶段的转移学习方法。我们根据图像大小来检测行和列设计一种常规的基于计算机视觉的方法,用于使用参数化内核,用于表类型分类和单元格检测。最后,我们使用现有的OCR发动机Tessaract从检测到的单元中提取文本。为了评估我们的系统,我们手动注释了现实世界中医学数据集(称为Meddata)的样本,该样本由涵盖不同表格结构的各种文档(在外观方面)组成,例如,诸如边界,部分边框,无边界或有色表。我们在实验上表明,Dexter在注释的现实数据集中优于市售的Amazon sworktract和Microsoft Azure形式识别器系统

In this paper, we propose DEXTER, an end to end system to extract information from tables present in medical health documents, such as electronic health records (EHR) and explanation of benefits (EOB). DEXTER consists of four sub-system stages: i) table detection ii) table type classification iii) cell detection; and iv) cell content extraction. We propose a two-stage transfer learning-based approach using CDeC-Net architecture along with Non-Maximal suppression for table detection. We design a conventional computer vision-based approach for table type classification and cell detection using parameterized kernels based on image size for detecting rows and columns. Finally, we extract the text from the detected cells using pre-existing OCR engine Tessaract. To evaluate our system, we manually annotated a sample of the real-world medical dataset (referred to as Meddata) consisting of wide variations of documents (in terms of appearance) covering different table structures, such as bordered, partially bordered, borderless, or coloured tables. We experimentally show that DEXTER outperforms the commercially available Amazon Textract and Microsoft Azure Form Recognizer systems on the annotated real-world medical dataset

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源