How Does a PDF Store Text
Asked Answered
A

1

14

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.

To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.

Agnomen answered 25/3, 2013 at 19:0 Comment(1)
I'm curious: How exactly is this "off topic?"Agnomen
E
14

PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.

Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.

If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.

Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

Excel answered 25/3, 2013 at 19:6 Comment(6)
So is it safe to say that because the text element merely tells the rendering engine what to draw where that this would be the reason why there is no context when you extract text from a PDF?Agnomen
You can say that. PDF says "here's a block of text" but it doesn't tell you if it's a paragraph, a title, or a table. This makes extracting pure text from PDF complicated.Excel
@Joni, it can get worse than that and you may have a PDF with reduced font information, in wich case you cannot even tell which unicode or ansi text character belongs to a particular PDF-character. It can also get better and you may have a tagged PDF, which may contain paragraph/title/line information, but in a general purpose app you cannot assume anything.Pimply
Thanks @yms, I'll make a note of that.Excel
It might be worth looking at the Text section of the PDF Reference too, if you really want to get deep into how it works and is stored.Crepe
@LyndonArmitage I did begin to read the Text section of the spec. I was really only trying to confirm something I had been spouting off at the office (regarding a PDF not storing text, but rather the instructions for drawing something that would end up resembling text). I have since confirmed that I was mistaken :) When I searched for articles describing how PDFs store text, I didn't find anything that was straight to the point (like mark stephens articles). My initial search for the spec turned up the ISO website and a cost of $250. The answer I sought wasn't that important!Agnomen

© 2022 - 2024 — McMap. All rights reserved.