论文标题
多个OCR表检测和提取的集团
A Conglomerate of Multiple OCR Table Detection and Extraction
论文作者
论文摘要
信息表示为表是紧凑而简洁的方法,可简化搜索,索引和存储要求。从可放松文档中提取和克隆表是更容易且广泛使用的,但是在从OCR文档或图像中检测和提取表方面,行业仍面临挑战。本文提出了一种从OCR文档中检测并提取多个表的算法。该算法结合了图像处理技术,文本识别和过程编码的组合来识别同一图像中的不同表,并将文本映射到数据框中的适当单元格中,可以将其存储为逗号分隔的值,数据库,Excel和多个其他可用形式。
Information representation as tables are compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used, however industry still faces challenge in detecting and extracting tables from OCR documents or images. This paper proposes an algorithm that detects and extracts multiple tables from OCR document. The algorithm uses a combination of image processing techniques, text recognition and procedural coding to identify distinct tables in same image and map the text to appropriate corresponding cell in dataframe which can be stored as Comma-separated values, Database, Excel and multiple other usable formats.