Programmatically divide scanned images into separate images

Asked 1/2, 2018 at 7:39 Answered 1/2, 2018 at 18:3

Solved image-processing imagemagick ocr image-preprocessing

In order to improve OCR quality, I need to preprocess my scanned images. Sometimes I need to OCR the image with few pictures (components on the page and they are at different angles - for example, a few paper documents scanned at one time), for example:

Is it possible to automatically programmatically divide such images into separate images that will contain every logical document? For example with a tool like ImageMagick or something else? Is there any solutions/technics exists for such problem?

Coquillage answered 1/2, 2018 at 7:39 Comment(4)

If you have > 50 images and lot of different combinations of separate images, you can try an ML powered solution. Something like app.nanonets.com/ObjectCategorySelection – Calvillo 1/2, 2018 at 9:10

I use OpenCV for image-processing. To seperate the first image, erode, threshold, and findContours, rotate if necessary. Then I get these detections and these cropeds. But the text in the image is too small to do OCR. Your updated image is even worse for image processing. – Eolith 2/2, 2018 at 17:16

It is just a sample in order to describe the issue. The original images have a little bit better quality. – Coquillage 2/2, 2018 at 17:18

@Silencer could you please show the code? – Coquillage 2/2, 2018 at 17:19

In ImageMagick 6, you can blur the image enough that the text overlaps and threshold so that the text boxes are each one large black region on a white background. Then you can use connected-components to find each separate black gray(0) region and its bounding box. Then crop the original image for each such region using the bounding box values.

Input:

Unix Syntax (adjust the blur to be just large enough to keep the text regions solid black):

infile="image.png"
inname=`convert -ping $infile -format "%t" info:`
OLDIFS=$IFS
IFS=$'\n'
arr=(`convert $infile -blur 0x5 -auto-level -threshold 99% -type bilevel +write tmp.png \
-define connected-components:verbose=true \
-connected-components 8 \
null: | tail -n +2 | sed 's/^[ ]*//'`)
num=${#arr[*]}
IFS=$OLDIFS
for ((i=0; i<num; i++)); do
#echo "${arr[$i]}"
color=`echo ${arr[$i]} | cut -d\  -f5`
bbox=`echo ${arr[$i]} | cut -d\  -f2`
echo "color=$color; bbox=$bbox"
if [ "$color" = "gray(0)" ]; then
convert $infile -crop $bbox +repage -fuzz 10% -trim +repage ${inname}_$i.png
fi
done

Textual Listing:

color=gray(255); bbox=892x1008+0+0
color=gray(0); bbox=337x430+36+13
color=gray(0); bbox=430x337+266+630
color=gray(0); bbox=202x147+506+252

tmp.png showing the blurred and thresholded regions:

Cropped Images:

Wamsley answered 1/2, 2018 at 8:35 Comment(8)

Thanks for your answer! I have installed ImageMagick 6.8.9-9 Q16 x86_64 2017-07-31 on my Ubuntu 16. Your scripts work without any errors but the "tmp.png" file contains only black background and that's it. What am I doing wrong? – Coquillage 1/2, 2018 at 9:0

this is the output:

color=; bbox=892x1008+0+0 color=; bbox=316x409+46+23 color=; bbox=409x316+277+640 color=; bbox=183x126+516+263 color=; bbox=8x16+154+222 color=; bbox=16x8+471+748 color=; bbox=8x9+680+376 color=; bbox=8x7+178+221 color=; bbox=7x8+481+772 color=; bbox=3x5+93+321 color=; bbox=5x3+383+687 color=; bbox=5x2+580+383 color=; bbox=5x2+565+383

– Coquillage 1/2, 2018 at 9:6

do I need to install any addition libs/scripts for IM in order to get your script working? – Coquillage 1/2, 2018 at 10:19

I have added another image with scanning artifacts. Will this approach work on such images also? – Coquillage 1/2, 2018 at 10:45

It will only work if each section of text is not too close together that the blur will merge them into the same object. You need ImageMagick 6.8.9.10 or higher to use connected components. Perhaps you need to upgrade. You may have a preliminary version that was not fully functioning. Try this command stand alone: convert image.png -blur 0x5 -auto-level -threshold 99% -type bilevel tmp.png. Does that look the same? Are you on windows. If so, windows need % escaped to %%. Also my looping code is only for unix. So what is your platform? – Wamsley 1/2, 2018 at 17:44

I'm trying it on Ubuntu 16. I tried to install IM7 how it is described here gist.github.com/marcinwol/6c4a713de517fb2ae89f5dd5be0e0ca4 but after installation your script fails with the following error: convert: no decode delegate for this image format 'PNG' @ error/constitute.c/ReadImage/509 What am I doing wrong and how to fix it ? Thanks ! – Coquillage 1/2, 2018 at 20:17

In ImageMagick 7, convert is replace with magick as the name of the command. So try magick image.png -blur 0x5 -auto-level -threshold 99% -type bilevel tmp.png – Wamsley 1/2, 2018 at 20:59

Finally, I got it working on my Ubuntu and IM7 with your original script! The first results are really impressive! Thank you very much! I'll continue to test it tomorrow. Thank you very much for your help !!! – Coquillage 1/2, 2018 at 21:4

alexanoid wrote: I have added another image with scanning artifacts. Will this approach work on such images also?

No it will not work well for several reasons. The second image you provide was much larger than the first. So it would need a much larger blur. It is jpg and has artifacts in it. JPG is not a good format, since the image in 'constant' regions is not really constant. The blur will pick up your artifacts and will need to have a different threshold to remove some of them. In your case, the top of the image has a good sized artifact that will get caught as an object. Finally your blurred and thresholded text region's bounding boxes overlap even if they do not touch. Thus one crop may include text from other regions.

Here is my test command to blur and threshold your image:

convert image.jpg -blur 0x50 -auto-level -threshold 95% -type bilevel tmp.png

Wamsley answered 1/2, 2018 at 18:3 Comment(2)

Thanks ! I really appreciate your help! What format should I use in order to prepare and then OCR the images ? Should I use for example PNG instead of JPG or what ? – Coquillage 1/2, 2018 at 20:21

Generally do not use any lossy compressed format such as JPG. PNG and TIFF are fine. But the main issue is that the file was scanned and picked up imperfections from the paper or glass. Also the text regions were too close together considering the resolution of the image and the large blur that was needed. – Wamsley 1/2, 2018 at 21:1

Recommended topics

Hot tags