Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?
Asked Answered
W

2

18

I have a PDF generated by a third party. I am trying to get the text out of it, but neither pdf2text nor copying and pasting results in readable text. After a little digging in the output (of either of two) I found that each character on the screen is made up of three bytes. For example, "A" is the bytes ef, 81, and 81. Looking at the metadata on the PDF it claims to be encoded in Identity-H, so I assume what I am seeing is a set of characters encoded in Identity-H. I have a partial mapping based on the documents I already have, but I want to make a more complete mapping. To do that I need something like an ASCII table for Identity-H.

Weathered answered 19/6, 2013 at 14:23 Comment(3)
Not an immediate solution but take a look at the CID (Identity-H) documents partners.adobe.com/public/developer/en/font/… and adobe.com/content/dam/Adobe/en/devnet/font/pdfs/…Tsarism
If you have a PDF with a font using identity-h, you are required to use a /ToUnicode map in the PDF for text extraction. Cf. Section 9.10.2 Mapping Character Codes to Unicode Values of ISO 32000-1:2008.Representative
Hi, check out my question about this #22431715Deerstalker
R
12

It is not always possible to extract text from a PDF especially when the /ToUnicode map is missing as pointed out by mkl.

If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of extracting the text yourself. If Acrobat cannot extract it then it is very unlikely that any other tool can extract the text correctly.

If you manually create an encoding table then you could use this to remap the extracted characters to their correct values but this most likely will only work for this one document.

Often this is done on purpose. I have seen documents that randomly remap characters differently for each font in the dot. It is used as a form of obfuscation and the only real way to extract text from these PDF's is to resort to OCR. There are many financial reports that use this type of trick to stop people from extracting their data.

Also, Identity-H is just a 1:1 character mapping for all characters from 0x0000 to 0xFFFF. ie. Identity is an identity mapping.

Your real problem is the missing /ToUnicode entry in this PDF. I suspect there is also an embedded CMap in your PDF that explains why there could be 3 bytes per character.

Refugiorefulgence answered 15/7, 2013 at 8:7 Comment(1)
So, basically I have to do what I have already done: create the mapping myself. Luckily all of the PDFs this agency is producing seem to use the same setup, so I doubt it is intentional (or if it is, they aren't very good at being obscure).Weathered
E
0

This isn't a direct answer to the question, but rather a solution to OCR that might be useful to people coming across this problem and cannot rectify it any other way:

from pdf2image import convert_from_path
from pytesseract import image_to_pdf_or_hocr
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO

def pdf_ocr_fix(input_pdf_path, output_pdf_path):
    # Convert PDF file to a list of images
    images = convert_from_path(input_pdf_path)
    
    # Prepare a PDF writer to combine PDF pages
    writer = PdfWriter()

    # Process each image with Tesseract and save as PDF
    for img in images:
        # Convert image to a PDF byte stream
        pdf_bytes = image_to_pdf_or_hocr(img, extension='pdf')
        # Use BytesIO to treat bytes as a file
        pdf_stream = BytesIO(pdf_bytes) # type: ignore
        reader = PdfReader(pdf_stream)

        # Append pages from this PDF to the output PDF
        for page in reader.pages:
            writer.add_page(page)

    # Write the combined pages to the output file
    with open(output_pdf_path, 'wb') as f_out:
        writer.write(f_out)

Expectancy answered 26/6 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.