Imagine, you have different OCR tools to read text from images but none of them gives you a 100% accurate output. Combined however, the result could come very close to the ground truth - What would be the best technique to "fuse" the text together to get good results?
Example:
Actual text
§ 5.1: The contractor is obliged to announce the delay by 01.01.2019 at the latest. The identification-number to be used is OZ-771LS.
OCR tool 1
5 5.1 The contractor is obliged to announce the delay by O1.O1.2019 at the latest. The identification-number to be used is OZ77lLS.
OCR tool 2
§5.1: The contract or is obliged to announce theedelay by 01.O1. 2O19 at the latest. The identification number to be used is O7-771LS
OCR tool 3
§ 5.1: The contractor is oblige to do announced he delay by 01.01.2019 at the latest. T he identification-number ti be used is OZ-771LS.
What could be a promising algorithm to fuse OCR 1, 2 and 3 to get the actual text?
My first idea was creating a "tumbling window" of an arbitrary length, compare the words in the window and take the words 2 out of 3 tools predict for every position.
For example with window size 3:
[5 5.1 The]
[§5.1: The contract]
[§ 5.1: The]
As you see, the algorithm doesn't work as all three tools have different candidates for position one (5, §5.1: and §).
Of course it would be possible to add some tricks like Levenshtein distance to allow some deviations but I fear this will not really be robust enough.