You probably take PDFs for granted. Having a document appear the same on any computer, in different programs, is now expected. We don't think about how it's done, and we tend to assume it's pretty straightforward-- certainly not an impossibly difficult problem. With all of the amazing things our computers can do these days, how hard could managing text and maybe a few pictures really be?
The answer is, incredibly.
To explain, I have to go back to 1993, when Adobe first developed the format. Few people under 40 realize just how revolutionary the format was at the time. Files never looked the same on different computers-- they likely couldn't even be opened! And at the time, nearly everything had to be printed out. You had to cross your fingers that the result on paper would look the same as on your screen. Clearly solutions made for text-only computing were no longer acceptable. People wanted to save their work on their floppy disks and be able to open them on a different computer. But there were dozens of document editing programs, with wildly different software behind them, plus all kinds of printers compatible only with one specific type of computer and operating system. How on earth could you develop a file format that produced identical results no matter what your setup was?
The answer was pretty ingenious. What is the only way that you can be sure a specific word appears at a specific place on the page? Well, you use an xy-coordinate system! Encoded into every PDF file are "objects" which it forms out of letters (or small groups thereof) and any pictures or figures in the document. These objects are saved completely independently, with no reference to any others. This allows the program to save them in any order, optimizing for file size (remember, it's 1993, and these have to be able to fit on a 1.44 MB floppy disk). When it has to render the page on your screen (or for printing), the program takes a "stream" of these objects and "paints" them onto the page
As I alluded to before, this format was really optimized for printing; it was meant for finished documents, and not anything that may need to be further edited. Nor was the format meant to be a two-way street between other document formats (or even plain text); documents were converted to PDFs but PDFs were never converted to anything else.
The same flexibility of PDF that allows it to perfectly describe a page no matter what also causes significant drawbacks to PDF as a format. Since the “objects” can consist of a single character, or a fragment of a word, or even a combination, there is no information encoded about the layout
or larger blocks of text. All of this information that we as humans get is from the whitespace, which is not directly encoded. One of the biggest hurdles in this area is maintaining reading order in
columns of text. Rendering characters is another major problem for PDFs. The text objects are actually one or more “glyphs,” which represent different characters. The font is what matches these glyph shapes to the characters displayed. These fonts can be saved within the PDF, but often they are not. This is fine with very standard fonts, but can cause a huge amount of problems for others.
PDFs can also pose some major accessibility issues. While Adobe claims accessibility to disabled persons as one of the benefits of the PDF format, this is definitively not the case. Ahmet Çakir, in his paper "Usability and Accessibility of Portable Document Format," examines the case of an important legal document from the government of Germany. This document was created by a publishing company in 2004; in 2012 it was slightly altered and again converted into a PDF document. Although the only difference was the version of Adobe used, there were hundreds of differences in the resulting document, some of which rendered the document unreadable by screen readers. This is a perfect example of the accessibility problem with PDFs– if the text cannot be accurately extracted, this will result in blind users receiving inaccurate or confusing information, or possibly being unable to access the document at all.
PDFs are also meant to ensure “preservation of document fidelity.” While this is certainly the case when measured against other tools and formats of the time (that is, the early 1990s), due to the above problems, this cannot always be assured. What is more, there is often no error message or other way to tell if there has been alterations to or omissions of any characters or whitespace without a human
having to check the result carefully. This "silent failure" therefore often goes completely unnoticed.
Many of these problems affect the ease and accuracy of text extraction from PDFs. Of
course, if a character did not even make it in to the PDF from the original source document, it will not make it into the extracted text. But even when the PDF is visually fine, the way fonts and glyphs are saved in the file can prevent text extraction. Some PDF generators use custom font encodings in order
to make the resulting file more compact; this can cause the extracted text to become “gibberish".
Additionally, spacing between words and characters can add more problems, as this can cause any word segmentation algorithms used to make mistakes. Columns can be one of the biggest culprits in errors, particularly reading order errors. But tables can be even more difficult, as they are inherently column based. They also break up the flow of text, making paragraph boundary detection more difficult.
Finally, one of the major problems in text extraction that is particularly troublesome in my field (Computational Linguistics) and other STEM fields is the inability to handle mathematical formulæ. Not only are the characters challenging to deal with, but they are often drawn with vector operations and so the position vertically can vary (and can obviously change an equation a great deal!). Most text extraction programs-- even those specifically designed for use with scientific papers-- do not even attempt to include them (either skipping them completely or inserting "[formula]"). Leaving
out formulæ entirely is not a good solution– but accuracy is also of tantamount importance, and we want to avoid that silent failure.
These problems are unlikely to be solved any time soon, as backwards compatibility is so crucial that it prevents changing the format significantly. Adobe did release "tagged PDF" in 2001, which adds the ability to encode document structure information in the file. This allows much better text extraction, but it is an optional feature, and can't really be added later to older documents. This means that fixing this problem is up to us in NLP. One partial solution is being developed in my old department at University of Washington, using a sort of extended XML to mark up extracted text. There are also many machine learning based solutions being developed, such as the new Tessaract OCR.
Next week there will be a part 2 of this PDF journey, on the beauty and pitfalls of Unicode. I like Unicode.
NB: This post was based on research I did for my master's project at University of Washington. You can find the whole project and full paper in my GitHub repository.
Sources:
Adobe Systems Incorporated. PDF Reference, version 1.7. 6th edition, 2006.
Ahmet Çakir, "Usability and accessibility of portable document format." Behaviour and In-
formation Technology, 2016.
Jörg Tiedemann. "Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing." Lecture Notes in Computer Science, 2014.
Comments