Extracting text from garbled PDF [closed]
Asked Answered
Y

3

12

I have a PDF file with valuable textual information.

The problem is that I cannot extract the text, all I get is a bunch of garbled symbols. The same happens if I copy and paste the text from the PDF reader to a text file. Even File -> Save as text in Acrobat Reader fails.

I have used all tools I could get my hands on and the result is the same. I believe that this has something to do with fonts embedding, but I don't know what exactly?

My questions:

  • What is the culprit of this weird text garbling?
  • How to extract the text content from the PDF (programmatically, with a tool, manipulating the bits directly, etc.)?
  • How to fix the PDF to not garble on copy?
Yeta answered 29/8, 2012 at 18:30 Comment(1)
I reworked the question, as it can perfectly fit SO, indeed PDF files are a common file format for automated text extraction, and the answers already perfectly answers how to programmatically check for this issue and fix it (I can also add an answer with a code snippet to do OCR). I vote to reopen the question, as it may prove useful to other developers.Selfrespect
Y
12

I went to a lot of people for help and OCR is the only solution to this problem

Yeta answered 31/8, 2012 at 17:27 Comment(4)
If you use Microsoft Office, OneNote has a very decent OCR, worked with a 100% accuracy for me for a PDF document exhibiting the above-mentioned problem.Morning
i love how crazy that solution is.. :)))))Quinonoid
what is OCR. could u please explain a bit. I have the same problem.Roseroseann
@Sodhisaab optical character recognition. I used github.com/tesseract-ocr/tesseractYeta
A
28

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine (because shapes of the characters are properly defined), but text from them can't be properly copied / extracted (because there is no information about meaning of used glyphs/shapes).

For example, Distiller produces such files when "Smallest File Size" preset is used.

Other than OCR there is no other way to retrieve text from such files, I'm afraid. We recently published a guide for how to OCR PDFs in .NET.

Also we have a sample code that shows how to perform OCR for unmapped characters and then replace them with correct Unicode values.


Supplementing the original answer

The original answer mentioned the "information about meaning of used glyphs/shapes". This information should be contained in a PDF structure called a /ToUnicode table. Such a table is required for each and every font which is embedded as a subset and uses non-standard (Custom) encoding.

In order to quickly evaluate the chances for extractability of text contents, you can use the pdffonts command line utility. This prints in tabular form a series of items about each font used by the PDF. The presence of a /ToUnicode table is indicated by column headed uni.

A few example outputs:

$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes yes     13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes no      12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0


$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf

    name                     type        encoding   emb sub uni object ID
    ------------------------ ----------- ---------- --- --- --- ---------
    BAAAAA+Helvetica         TrueType    WinAnsi    yes yes yes     12  0
    CAAAAA+Helvetica-Bold    TrueType    WinAnsi    yes yes no      13  0

The good.pdf lets you extract the text contents for both fonts correctly, because both fonts have an accompanying /ToUnicode table.

For the bad1.pdf and the bad2.pdf the text extraction succeeds only for one of the two fonts, and fails for the other, because only one font has a /ToUnicode table.

I, Kurt Pfeifle, have recently created a series of hand-coded PDF files to demonstrate the influence of existing, buggy, manipulated or missing /ToUnicode tables in the PDF source code. These PDFs are extensively-commented and suitable to be explored with the help of a text editor. Above pdffonts output examples were created with the help of these hand-coded files. (There are a few more PDFs showing different results, which an interested reader may want to explore...)

Alegre answered 30/8, 2012 at 5:7 Comment(5)
@Yeta Basically, such files do not contain glyph-to-character mapping information and at the same time use non-standard (non-ASCII'ish) encodings; in the absence of proper glyph-to-character mapping information many text extractors assume some standard encoding and try extracting anyways. Whenever this assumption fails, garbage is the result.Genitalia
I've upvoted your answer as well as supplemented it with some info. I hope this is acceptable to you :-)Leila
I've also voted to re-open the OP (which was closed for some obscure reason).Leila
@KurtPfeifle Sure, thanks for the supplement.Alegre
How, in this case, would you extract the content or extract fonts and then apply them to content extracted without fonts?Rooftop
Y
12

I went to a lot of people for help and OCR is the only solution to this problem

Yeta answered 31/8, 2012 at 17:27 Comment(4)
If you use Microsoft Office, OneNote has a very decent OCR, worked with a 100% accuracy for me for a PDF document exhibiting the above-mentioned problem.Morning
i love how crazy that solution is.. :)))))Quinonoid
what is OCR. could u please explain a bit. I have the same problem.Roseroseann
@Sodhisaab optical character recognition. I used github.com/tesseract-ocr/tesseractYeta
A
3

I had the same problem. Uploading it to Google Drive, opening with Google Docs and copying the text from there worked for me.

Antimalarial answered 19/11, 2014 at 9:46 Comment(7)
More simple solution is: pull the pdf to a chrome window. You can copy out the text - at least I couldAfterimage
Worked for me. This answer seems more practical than the OCR answers (except when building some sort of automation). (Chrome method from gsziszi did not work for me).Gamecock
@Afterimage Could you please make your comment an answer? It works and is obviously more practical than using OCR. Thanks!Knowle
as, this question is closed, not possible to add more answersAfterimage
not working for me the way you told. i tried both by opening file in chrome window as well as by uploading it to google drive and opening from thereBeestings
Unfortunately that's not the same problem of the original question. The real problem is described here: forums.adobe.com/thread/915012Vandusen
Google Drive + Google Docs worked for me!Disject

© 2022 - 2024 — McMap. All rights reserved.