how to convert/match a handwritten list of names? (HWR)
Asked Answered
G

1

12

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.

My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.

Does this approach sound like a good one? If not, other ideas?

I tried using tesseract on a sample sheet (see below)

enter image description here

I used:

tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

I am assuming it didn't like line 2 because I went below the line.

The results I got were:

1.. AM: (harm;

l. ’E (J 22 a 00k

2‘ wau \\) [HQ

4. KIM TAYLOE
5. LN] Davis

6‘ Mzflé! Ha K

Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.

I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.

Garganey answered 14/11, 2017 at 21:18 Comment(2)
Possible duplicate of #39556943Landlordism
thanks Mike I am going to make my description a bit more generic, those links helped, I guess my general question is this possible, how do I do it.Garganey
W
0

Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .

Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).

I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".

You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

Whitted answered 19/11, 2017 at 0:49 Comment(1)
Thx, I tried - tesseract simple.png -psm 4 outtxt -c tessedit_char_whitelist=" ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789." A small bit better, but not much, I will try the quality improvements next.Garganey

© 2022 - 2024 — McMap. All rights reserved.