How to combine the results of multiple OCR tools to get better text recognition [closed]
Asked Answered
T

1

6

Imagine, you have different OCR tools to read text from images but none of them gives you a 100% accurate output. Combined however, the result could come very close to the ground truth - What would be the best technique to "fuse" the text together to get good results?

Example:

Actual text

§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.

OCR tool 1

5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.

OCR tool 2

§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS

OCR tool 3

§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.

What could be a promising algorithm to fuse OCR 1, 2 and 3 to get the actual text?

My first idea was creating a "tumbling window" of an arbitrary length, compare the words in the window and take the words 2 out of 3 tools predict for every position.

For example with window size 3:

[5 5.1 The] 
[§5.1: The contract] 
[§ 5.1: The] 

As you see, the algorithm doesn't work as all three tools have different candidates for position one (5, §5.1: and §).

Of course it would be possible to add some tricks like Levenshtein distance to allow some deviations but I fear this will not really be robust enough.

Trapper answered 26/3, 2019 at 23:28 Comment(1)
Might be helpful to view this as a merging problem. Not a trivial topic, though.Soulless
Z
0

To me this looks like a beautiful ensemble inference problem.

There is more than one approach to merge the predictions of multiple models. It is easiest for classification problems where intuitively a prediction of a model can be thought of as a vote. Then it is up to you to decide how you want to process the votes. Do you want to weigh a specific model more (for example if it has superior performance), do you want to have an average of the predictions (not so meaningful for your nlp usecase), do you want to opt for the class (character) with the maximal number of votes.

This is called maxVoting. And I will show that as an example.

    from sklearn.base import BaseEstimator, TransformerMixin
    from collections import Counter
    
    class MaxVotingEnsemble(BaseEstimator, TransformerMixin):
        def transform(self, X):
            print("\nTransforming predictions with MaxVotingEnsemble...")
    
            # Zip the predictions for each position
            zipped_predictions = zip(*X)
    
            # Find the mode for each position
            merged_predictions = []
            for position, predictions in enumerate(zipped_predictions):
                print(f"\nProcessing position {position + 1}: {predictions}")
    
                # Find the mode for the current position
                mode_prediction = Counter(predictions).most_common(1)[0][0]
                print(f"Mode prediction for position {position + 1}: {mode_prediction}")
    
                merged_predictions.append(mode_prediction)
    
            return merged_predictions

I ran this in Python 3.11 and I get:

Merged Predictions:
§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identifiiation-nnmber to be used is OZ-771LS.

As you can see this works nicely out of the box. However this is mainly due to the fact, that in case there is no majority the prediction ops for the first string (which on closer inspection is already a pretty good approximation of the desired result).

This is where the low-hanging fruits have been harvested and it gets more cumbersome to get an improved result. Here are some "next steps"-ideas:

  1. Usually adding more models will get you better results as the issue with opting to the first string will get less likely.
  2. Weighing models based on their performances makes your prediction more robust as well.
  3. Currently characters are compared naively based on their index / position. We can use a sequence alignment algorithm to find the optimal alignment of characters. One such algorithm is the Needleman-Wunsch algorithm, which is often used in bioinformatics for sequence alignment. In python the pairwise2 module from the Bio package got your back.

And this is where I leave you, providing a first step for setting up your solution.

Zenithal answered 7/12, 2023 at 9:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.