Extract images from PDF, how to handle JBIG2 encoded
Asked Answered
E

2

4

I have a bunch of PDF files, some of them are pure text but some are fully or partially saved as "One image per page" because they are generated from a scanner.

I need to extract all images contained in the PDF and then examine each image separately.

I was able to extract most of the images with a python script found here in SO see question:

Extract images from PDF without resampling, in python?

Some of the included images were encoded using JBIG2 and I could not find any python or other tool to convert jbig2 into something that could be easily opened with generic graphic tool.

Eupepsia answered 25/3, 2020 at 14:40 Comment(0)
E
5

Well I have been struggling with this for many weeks, many answers from SO helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.

In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.

As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.

So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.

Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)

First step:

apt-get install poppler-utils Then I was able to run command line tool called pdfimages like this:

pdfimages -all myfile.pdf ./images_found/

With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)

In the list you could find several types of images (depends on you pdf) like: png, jpg, tiff; all these are easily readable with any graphic tool.

Then you will have some files named like: -145.jb2e and -145.jb2g.

These 2 files contain ONE IMAGE encoded in jbig2 which is saved in 2 different files one for the header and one for the data

Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec

So first you need to install this magic tool:

apt-get install jbig2dec

then you can run:

jbig2dec -t png -145.jb2g -145.jb2e

You are going to finally be able to get all extracted images converted into something useful.

good luck!

Eupepsia answered 25/3, 2020 at 14:40 Comment(1)
does -t png really works ? With file utility I get Netpbm image data, size = 902 x 1523, rawbits, bitmap it is far more useable but it seem that type png isn't emitted. I get -145.pbm.Inroad
T
1

you can try this https://github.com/Charltsing/JBIG2Viewer

it can load and save jbig2 image

Theodolite answered 1/8, 2023 at 3:36 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Natalienatalina

© 2022 - 2024 — McMap. All rights reserved.