A fully automated approach to analyze PDF scientific articles and parse them into correctly ordered sections and extra metadata is described in this paper. The result is the text, in correct order, from the body of the work, along with a table of contents drawn from titles and subtitles.
The physical and logical layout analysis and extraction of the body text and table of contents are done in an unsupervised and model-free manner, using only the information provided by the document. This approach differs from several others where some pre-training is required. The other paper in the literature describing analysis of scientific documents (Reference [26]) [1] requires a formal model description for each journal’s layout scheme. Attacking the challenge of ordered text extraction as a noisy problem (slightly incorrect character height/width information, inaccurate font description, varying layout formats by journal), the authors use clustering combined with merge/split steps to iteratively build the body text in a bottom-up manner, starting with individual characters. In particular, k-means and hierarchical access control (HAC) are used.
Constructing the reading order from blocks is accomplished with a modified version of the method described by Berg et al. [2], using both geographic position on the page and order within the PDF document to determine placement of graphics or tables. A nearest neighbor search (left, right, above, below) is found for each block of text, yielding a directed neighborhood graph of blocks on the page. This graph can then be used to characterize blocks as headers, footers, graphs, tables, section headers, and so on.
A thorough evaluation was done. The block categorization used the GROTAP dataset (known to be correctly labeled zones from scientific articles); the unsupervised algorithm produced labels equivalent to the GROTAP’s supervised algorithm. To evaluate quality of body and heading extraction, articles were selected from the PubMed dataset, which includes a structured Extensible Markup Language (XML) file for each article. The algorithm’s results are compared to the output from the ParsCit package, and the authors’ algorithm more closely approaches “ground truth.” Table of contents (TOC) evaluation used the same dataset; quality was based on minimal tree edit distance as compared to the PubMed heading. About half of the document TOCs were extracted correctly, and most “came close,” or at least closer than the ParsCit approach.
The authors leave the door open in their design for future inclusion of optical character recognition (OCR)-based approaches, as described by Berg et al. [2]. They also propose table, image, and graphic extraction for future work, as described by Chao and Fan [3].
The paper is well written, and the algorithm evaluation is one of the best (most convincing, repeatable) I have seen in this area of work. It would be even better to know if improved scores (for example: 0.79 as compared to 0.76) were statistically significant or not, but they are certainly convincing to be “at least as good as.”