Generate "fuzzy" difference of two files in Python, with approximate comparison of floats
Asked Answered
H

1

7

I have an issue for comparing two files. Basically, what I want to do is a UNIX-like diff between two files, for example:

$ diff -u left-file right-file

However my two files contain floats; and because these files were generated on distinct architectures (but computing the same things), the floating values are not exactly the same (they may differ by, say, 1e-10). But what I seek by 'diffing' the files is to find what I consider to be significant differences (for example difference is more than 1e-4); while using the UNIX command diff, I get almost all my lines containing the floating values being different! That's my problem: how can I get a resulting diff like 'diff -u' provides, but with less restrictions regarding comparison of floats?

I thought I would write a Python's script to do that, and found out the module difflib which provides diff-like comparison. But the documentation I found explains how to use it as-is (through a single method), and explains the inner objects, but I cannot find anything regarding how to customize a difflib object to meet my needs (like rewriting only the comparison method or such)... I guess a solution could be to retrieve the unified difference, and parse it 'manually' to remove my 'false' differences, by this is not elegant; I would prefer to use the already existing framework.

So, does anybody know how to customize this lib so that I can do what I seek ? Or at least point me in the right direction... If not in Python, maybe a shell script could to the job?

Any help would be greatly appreciated! Thanks in advance for your answers!

Hobbyhorse answered 24/6, 2010 at 8:23 Comment(5)
Maybe you also like: Good Python modules for fuzzy string comparison?Kindergarten
a simpler alternative would be to pre-process file and format floats correctly to desired accuracyStrange
Please post a couple of corresponding lines from sample input files?Childbearing
(For the float lines, we cannot put them through difflib, so we will write a custom differ (ad-hoc, or regex-based). I hope you don't have text and floats mixed on one line.)Childbearing
I gave you a good answer five years ago... feel free to accept it.Childbearing
C
4

In your case we specialize the general case: before we pass things into difflib, we need to detect and separately handle lines containing floats. Here is a basic approach, if you want to generate the deltas, lines of context etc you can build on this. Note it is easier to fuzzy-compare floats as actual floats rather than strings (although you could code a column-by-column differ, and ignore characters after 1-e4).

import re

float_pat = re.compile('([+-]?\d*\.\d*)')
def fuzzydiffer(line1,line2):
    """Perform fuzzy-diff on floats, else normal diff."""
    floats1 = float_pat.findall(line1)
    if not floats1:
        pass # run your usual diff() 
    else:
        floats2 = float_pat.findall(line2)
        for (f1,f2) in zip(floats1,floats2):
            (col1,col2) = line1.index(f1),line2.index(f2)
            if not fuzzy_float_cmp(f1,f2):
                print "Lines mismatch at col %d", col1, line1, line2
            continue
    # or use a list comprehension like all(fuzzy_float_cmp(f1,f2) for f1,f2 in zip(float_pat.findall(line1),float_pat.findall(line2)))
    #return match

def fuzzy_float_cmp(f1,f2,epsilon=1e-4):
    """Fuzzy-compare two strings representing floats."""
    float1,float2 = float(f1),float(f2)
    return (abs(float1-float2) < epsilon)

Some tests:

fuzzydiffer('text: 558.113509766 +23477547.6407 -0.867086648057 0.009291785451', 
'text: 558.11351 +23477547.6406 -0.86708665 0.009292000001')

and as a bonus, here's a version that highlights column-diffs:

import re

float_pat = re.compile('([+-]?\d*\.\d*)')
def fuzzydiffer(line1,line2):
    """Perform fuzzy-diff on floats, else normal diff."""
    floats1 = float_pat.findall(line1)
    if not floats1:
        pass # run your usual diff() 
    else:
        match = True
        coldiffs1 = ' '*len(line1)
        coldiffs2 = ' '*len(line2)
        floats2 = float_pat.findall(line2)
        for (f1,f2) in zip(floats1,floats2):
            (col1s,col2s) = line1.index(f1),line2.index(f2)
            col1e = col1s + len(f1)
            col2e = col2s + len(f2)
            if not fuzzy_float_cmp(f1,f2):
                match = False
                #print 'Lines mismatch:'
                coldiffs1 = coldiffs1[:col1s] + ('v'*len(f1)) + coldiffs1[col1e:]
                coldiffs2 = coldiffs2[:col2s] + ('^'*len(f2)) + coldiffs2[col2e:]
            #continue # if you only need to highlight first mismatch
        if not match:
            print 'Lines mismatch:'
            print '  ', coldiffs1
            print '< ', line1
            print '> ', line2
            print '  ', coldiffs2
        # or use a list comprehension like
        #    all()
        #return True

def fuzzy_float_cmp(f1,f2,epsilon=1e-4):
    """Fuzzy-compare two strings representing floats."""
    print "Comparing:", f1, f2
    float1,float2 = float(f1),float(f2)
    return (abs(float1-float2) < epsilon)
Childbearing answered 3/7, 2011 at 1:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.