text layout recognition with python

Asked 11/7, 2011 at 20:18 Answered 11/7, 2011 at 22:29

Solved python image-processing ocr document-layout-analysis

I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the formB folder, etc...). I feel like the best way to match the files and types is based on their text outlines, but am totally new to image processing, so if there's a better solution, then I'm all ears.

I'm working in python. Any ideas of a best way to do this? PIL? OpenCV? imageMagick?

Thanks in advance...

Languorous answered 11/7, 2011 at 20:18 Comment(0)

This library is probably of interest to you -
http://code.google.com/p/ocropus/
Its made by googlers and lets you do OCR and layout analysis from python.
I had some trouble installing it, but that was quite a while back, so things may have gotten fixed by now.

Soundboard answered 11/7, 2011 at 20:23 Comment(1)

That may be an option! Thanks for the input! – Languorous 11/7, 2011 at 20:28

I don't know in what format you've got the scanned documents, but pdfminer can do layout analysis for pdf. I guess it would fit the bill for your purpose, provided you get the documents in somewhat decent pdf format (if you've just got "pure images", it won't do you any good)

Bung answered 11/7, 2011 at 22:29 Comment(0)

Recommended topics

Hot tags