How to combine the results of multiple OCR tools to get better text recognition [closed]

Imagine, you have different OCR tools to read text from images but none of them gives you a 100% accurate output. Combined however, the result could come very close to the ground truth - What would be the best technique to "fuse" the text together to get good results?

Example:

Actual text

§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.

OCR tool 1

5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.

OCR tool 2

§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS

OCR tool 3

§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.

What could be a promising algorithm to fuse OCR 1, 2 and 3 to get the actual text?

My first idea was creating a "tumbling window" of an arbitrary length, compare the words in the window and take the words 2 out of 3 tools predict for every position.

For example with window size 3:

[5 5.1 The]

[§5.1: The contract]

[§ 5.1: The]

As you see, the algorithm doesn't work as all three tools have different candidates for position one (5, §5.1: and §).

Of course it would be possible to add some tricks like Levenshtein distance to allow some deviations but I fear this will not really be robust enough.

from sklearn.base import BaseEstimator, TransformerMixin from collections import Counter class MaxVotingEnsemble(BaseEstimator, TransformerMixin): def transform(self, X): print("\nTransforming predictions with MaxVotingEnsemble...") # Zip the predictions for each position zipped_predictions = zip(*X) # Find the mode for each position merged_predictions = [] for position, predictions in enumerate(zipped_predictions): print(f"\nProcessing position {position + 1}: {predictions}") # Find the mode for the current position mode_prediction = Counter(predictions).most_common(1)[0][0] print(f"Mode prediction for position {position + 1}: {mode_prediction}") merged_predictions.append(mode_prediction) return merged_predictions

Recommended topics

Hot tags