Things aren’t always what they seem: The PDF challenge (accepted)

Image CC-BY

There are situations where text miners might struggle with getting the textual data to perform the mining on in the first place. One problem for us is that most of scientific publications – especially in social sciences and humanities – are only available in PDF format, which is not suitable to be read and processed by computers. The OpenMinTeD social sciences work group accepted the challenge to work on this problem.

PDF is not very suitable for extracting and processing the text. When you use your favorite PDF viewer, you might not suspect that, because you will have no difficulties reading documents and highlighting pieces of text (except in those rare cases where you got a very badly scanned copy from a friend or colleague). This is true since the original purpose of PDF is to present documents in a uniform manner (irrespective of hardware and software configuration). But when you try to actually read and process the text from PDF files using software, you will face a lot of problems.

First of all, it might be the case that the document isn’t actually text, it’s an image. This happens when someone uses a scanner to digitize a document and simply saves the scanned image as a PDF. There are tools that can help with a process called Optical Character Recognition (OCR). Those tools take the image and recognize the characters contained therein, to detect the “real” text that is hidden in the image. Depending on image quality and other features of the document, the result can never be perfect, although for some use cases it’s good enough to work with.

Let us ignore this case for now. Even if the actual characters of the text are in the PDF (we say it’s born digital), there are still some major problems to face. To see why, let us have a look at the PDF specification. In a PDF document, everything is an object. Those objects include bookmarks, colors, forms, figures, information about fonts and encodings, and so on. So the actual text is hidden in a lot of objects. Actually, you won’t even see the text you see with your PDF reader when you open the file in a regular text editor, it just looks like gibberish. Why? Because it’s not actual characters that are specified in this file, but their respective glyphs, which is the appearance of the character in a specific font and style. This way, a PDF file always looks exactly like the creator of the file wanted it to, including styles used, colors and spacing.

So this is a real problem – how to get the actual text out of this file format. Fortunately, there are already quite a few tools that deal with that and produce more or less usable results. Some of them are even specifically tailored to the scientific publications domain such that they are able to deal with headings, two-column text, footnotes and the references section; which are prevalent in scientific articles.

Our workflow for dealing with the PDF challenge

So we start off with a lot of PDF files of scientific publications, and we want to end up with a format that is usable for our text mining use cases. We chose PDFX (PDF to XML converter) service for text extraction from our articles, because it’s specifically tailored to scientific publications and generates a higher quality output compared to the other PDF readers we tested out. It takes as input a PDF file and outputs an XML file with markup for various text elements in the document, for instance paper title, authors, section titles, and main text. There are some restrictions imposed by the service though, for example on document length and file size, and the language of the textual content, as it uses some dictionaries internally and there are just a few languages supported by PDFX. There’s also an option to run a sentence splitter in the process, which performs better than other sentence splitter components [on scientific text].

Even though PDFX performs well on our data, it still has many drawbacks (so do other PDF readers we tested). This is due to existence of many different layouts for texts in various printed forms (e.g. conference proceedings, journals, reports, theses). Since these PDF readers are tested against a small set of papers, it is expected that their output contains some errors. Apart from this, it is also challenging to distinguish elements like table contents, captions, footnotes and sidenotes from the main text.

We implemented a component that sends requests to PDFX http interface to convert PDF documents to XML format. The XML output of PDFX is then further processed by another component that converts it to XMI. The XML Metadata Interchange (XMI) format is a standard for expressing object graphs in XML and is used as a serialization format for UIMA Common Analysis System (CAS). UIMA is a famous framework in Text Mining, and since we plan to make use of components from DKPro software repository in our use case implementation, we decided to use this format for seamless integration.

Our PDFX XML reader component stores the document title, abstract, main text and footnotes, and ignores everything else (like captions, page headers). The reader component tries to detect paragraph boundaries as well, since this will help a lot in further text analysis tasks. During the conversion process, we also need to fix some hyphenation errors that were introduced by the PDF to text conversion. For example, if a word like “substantial” is broken across two lines, so that “sub-“ appears at the end of one line and “stantial” at the beginning of the next, the result will be something like “sub- stantial”. Luckily, there is a component in DKPro-Core component collection that deals with hyphenation removal, so we make use of that to clean our text. Additionally, we try our best to omit tables and captions of tables and figures in the output.

The final XMI files can now be fed into other UIMA compliant components. In our workflow, the next step is the annotation process – our annotation tool WebAnno supports XMI files as input. The annotated documents can be exported to XMI again, or into some other CAS serialization format. They can also be fed directly into another UIMA (DKPro) component in a pipeline, for example a Named Entity Recognizer that uses the annotations as training data and can automatically produce annotations on new data.


In this blog post, we have laid out a general problem that arises for every text miner who has to deal with input in PDF format. We then explained how we deal with this problem in the development of our use case applications. In the OpenMinTeD project, we will make our preprocessing component available, so that everyone with his/her own text mining use case, who has to start from pdf documents, can easily get machine-readable text out of them and feed them into any other component for further processing.

This blogpost was written by Mandy Neumann (GESIS) and Masoud Kiaeeha (TU Darmstadt). Image by Martine Oudenhoven (LIBER).