As mentioned in my last post, many problems with PDFs (specifically with extracting text from them) is due to the way that characters and fonts are used and stored (or not) in the file. As a refresher, characters are not saved directly, but instead text objects are saved in the content string as glyphs. A glyph is a specific rendering of a character; a font defines the glyphs for each character– this distinction is actually a central tenant of Unicode and not unique to the PDF format. However, the relationship between character and glyph is not 1 to 1. Some characters actually have two (or potentially more) corresponding glyphs, such as characters in Arabic or Hebrew scripts, where a character has different glyphs depending on the position within the word. And in the case of ligatures, for example, one glyph actually represents two characters.
I want to explain a little bit about how PDF encodes fonts, in order to show why they can cause so many problems. There are four main types of fonts used in PDFs. Type 1 and TrueType fonts are the most common and generally pose no problem, particularly the former, which is a set of Adobe’s own fonts. Type 0 fonts are also called composite fonts, as they reference multiple other fonts. They are commonly used with non-Latin scripts and generally do not cause many errors. Problems occur much more frequently with Type 3 fonts. These fonts can be generated by the program generating the PDF, and use any and all of the available graphical operators. They don’t necessarily (or usually) contain any meta-information about glyphs or the mapping of them to Unicode characters. This Unicode mapping from glyph to character is an incredibly difficult problem and may be considered the very foundation of PDF text extraction. Some of the many issues that can lead to problems include those from diacritics, homoglyphs, legacy compatibility, and the related issue of canonical equivalence.
These can lead to unexpected results, particularly by linguists who are often working with multiple
languages, potentially with multiple scripts, as well as characters from the International Phonetic Alphabet.
One major problem is homoglyphs, or different characters that are converted into identical (or nearly so) glyphs. For example, the Latin character <A> appears identical to the Cyrillic character <A>. This is extremely widespread in IPA characters; an example there is that the alveolar click <!> actually has a different Unicode code point (and is therefore a different character) from the exclamation point <!> despite looking identical. Different fonts can exacerbate this problem.
A similar, though distinct, phenomenon is canonical equivalence. This is when a character can exist
either as one code point or as a combination of base character and a letter modifier. This is
mainly present for legacy compatibility (which is ultimately the root of many issues with Unicode encoding). Canonical equivalence differs slightly from regular homoglyphs; they do not merely look identical, but should be treated exactly the same by any program. One example is the character <ñ>. This can (and should) be represented as a single precomposed character with code point U+00F1. However, the sequence of <n> (U+006E) and the combining tilde (U+0303) can also be used. This is intended for legacy compatibility, but still occurs in extracted text today.
To be extra confusing, here are a couple similar cases that explicitly are not canonically equivalent, although they seem similar to the above. These are the cedilla, the ogonek (reversed cedilla), and the horn.
You can see why diacritics are a major source of problems for text extraction. Even as common of a character as <ü> is often converted into two separate glyphs, one for the <u> and another for the umlaut diacritic. The use of multiple diacritics can lead to even more problems, especially if the diacritics are supposed to be placed above or below each other (as is common in IPA transcription). The level, as well as number, of Unicode blocks used increases the likelihood of errors generated; characters from the Basic Latin block almost never cause any errors. This is a major problem for linguists, as while there is a block named “IPA extensions, it does not include characters that are already present in previously assigned blocks. This leads to the full set of IPA characters being spread
out upon 12 different blocks. Non-phoneticians are not exempt from these issues, however– even in
Spanish, two additional blocks are needed beyond Basic Latin.
There's a great book on problems with Unicode that linguists in particular may run into, called The Unicode Cookbook for Linguists by Moran and Cysouw, from Language Science Press. (Incidentally, Language Science Press has some pretty awesome books, all available for free online, and they make the GitHub repository of LaTeX files for each book available as well. I used some of these in my PDF text extraction research and they are a great resource).
Comments