Programmatically divide scanned images into separate images
Asked Answered
C

2

5

In order to improve OCR quality, I need to preprocess my scanned images. Sometimes I need to OCR the image with few pictures (components on the page and they are at different angles - for example, a few paper documents scanned at one time), for example:

enter image description here

Is it possible to automatically programmatically divide such images into separate images that will contain every logical document? For example with a tool like ImageMagick or something else? Is there any solutions/technics exists for such problem?

Coquillage answered 1/2, 2018 at 7:39 Comment(4)
If you have > 50 images and lot of different combinations of separate images, you can try an ML powered solution. Something like app.nanonets.com/ObjectCategorySelectionCalvillo
I use OpenCV for image-processing. To seperate the first image, erode, threshold, and findContours, rotate if necessary. Then I get these detections and these cropeds. But the text in the image is too small to do OCR. Your updated image is even worse for image processing.Eolith
It is just a sample in order to describe the issue. The original images have a little bit better quality.Coquillage
@Silencer could you please show the code?Coquillage
W
5

In ImageMagick 6, you can blur the image enough that the text overlaps and threshold so that the text boxes are each one large black region on a white background. Then you can use connected-components to find each separate black gray(0) region and its bounding box. Then crop the original image for each such region using the bounding box values.

Input:

enter image description here

Unix Syntax (adjust the blur to be just large enough to keep the text regions solid black):

infile="image.png"
inname=`convert -ping $infile -format "%t" info:`
OLDIFS=$IFS
IFS=$'\n'
arr=(`convert $infile -blur 0x5 -auto-level -threshold 99% -type bilevel +write tmp.png \
-define connected-components:verbose=true \
-connected-components 8 \
null: | tail -n +2 | sed 's/^[ ]*//'`)
num=${#arr[*]}
IFS=$OLDIFS
for ((i=0; i<num; i++)); do
#echo "${arr[$i]}"
color=`echo ${arr[$i]} | cut -d\  -f5`
bbox=`echo ${arr[$i]} | cut -d\  -f2`
echo "color=$color; bbox=$bbox"
if [ "$color" = "gray(0)" ]; then
convert $infile -crop $bbox +repage -fuzz 10% -trim +repage ${inname}_$i.png
fi
done


Textual Listing:

color=gray(255); bbox=892x1008+0+0
color=gray(0); bbox=337x430+36+13
color=gray(0); bbox=430x337+266+630
color=gray(0); bbox=202x147+506+252

tmp.png showing the blurred and thresholded regions:

enter image description here

Cropped Images:

enter image description here

enter image description here

enter image description here

Wamsley answered 1/2, 2018 at 8:35 Comment(8)
Thanks for your answer! I have installed ImageMagick 6.8.9-9 Q16 x86_64 2017-07-31 on my Ubuntu 16. Your scripts work without any errors but the "tmp.png" file contains only black background and that's it. What am I doing wrong?Coquillage
this is the output: color=; bbox=892x1008+0+0 color=; bbox=316x409+46+23 color=; bbox=409x316+277+640 color=; bbox=183x126+516+263 color=; bbox=8x16+154+222 color=; bbox=16x8+471+748 color=; bbox=8x9+680+376 color=; bbox=8x7+178+221 color=; bbox=7x8+481+772 color=; bbox=3x5+93+321 color=; bbox=5x3+383+687 color=; bbox=5x2+580+383 color=; bbox=5x2+565+383Coquillage
do I need to install any addition libs/scripts for IM in order to get your script working?Coquillage
I have added another image with scanning artifacts. Will this approach work on such images also?Coquillage
It will only work if each section of text is not too close together that the blur will merge them into the same object. You need ImageMagick 6.8.9.10 or higher to use connected components. Perhaps you need to upgrade. You may have a preliminary version that was not fully functioning. Try this command stand alone: convert image.png -blur 0x5 -auto-level -threshold 99% -type bilevel tmp.png. Does that look the same? Are you on windows. If so, windows need % escaped to %%. Also my looping code is only for unix. So what is your platform?Wamsley
I'm trying it on Ubuntu 16. I tried to install IM7 how it is described here gist.github.com/marcinwol/6c4a713de517fb2ae89f5dd5be0e0ca4 but after installation your script fails with the following error: convert: no decode delegate for this image format 'PNG' @ error/constitute.c/ReadImage/509 What am I doing wrong and how to fix it ? Thanks !Coquillage
In ImageMagick 7, convert is replace with magick as the name of the command. So try magick image.png -blur 0x5 -auto-level -threshold 99% -type bilevel tmp.pngWamsley
Finally, I got it working on my Ubuntu and IM7 with your original script! The first results are really impressive! Thank you very much! I'll continue to test it tomorrow. Thank you very much for your help !!!Coquillage
W
1

alexanoid wrote: I have added another image with scanning artifacts. Will this approach work on such images also?

No it will not work well for several reasons. The second image you provide was much larger than the first. So it would need a much larger blur. It is jpg and has artifacts in it. JPG is not a good format, since the image in 'constant' regions is not really constant. The blur will pick up your artifacts and will need to have a different threshold to remove some of them. In your case, the top of the image has a good sized artifact that will get caught as an object. Finally your blurred and thresholded text region's bounding boxes overlap even if they do not touch. Thus one crop may include text from other regions.

Here is my test command to blur and threshold your image:

convert image.jpg -blur 0x50 -auto-level -threshold 95% -type bilevel tmp.png

enter image description here

Wamsley answered 1/2, 2018 at 18:3 Comment(2)
Thanks ! I really appreciate your help! What format should I use in order to prepare and then OCR the images ? Should I use for example PNG instead of JPG or what ?Coquillage
Generally do not use any lossy compressed format such as JPG. PNG and TIFF are fine. But the main issue is that the file was scanned and picked up imperfections from the paper or glass. Also the text regions were too close together considering the resolution of the image and the large blur that was needed.Wamsley

© 2022 - 2024 — McMap. All rights reserved.