Python Difflib Deltas and Compare Ndiff

E

3

4

I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes. I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in to my head apparently.

I was trying to recreate this in a somewhat simple program shown below, but the thing that I seem to be missing is that the Delta's contain at least as much as the original file, and more.

Is it not possible to get to just the pure changes? The reason I ask is hopefully obvious - to save disk space.
I could just save the entire chunk of code each time, but it would be better to save current code once, then small diffs of the changes.

I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?

Will difflib work for me - or I need to find a more professional package with more features?

# Python Difflib demo 
# Author: Neal Walters 
# loosely based on http://ahlawat.net/wordpress/?p=371
# 01/17/2011 

# build the files here - later we will just read the files probably 
file1Contents="""
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "j=" + j 
   print "XYZ"
"""

file2Contents = """
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "XYZ"
print "The end"
"""

filename1 = "diff_file1.txt" 
filename2 = "diff_file2.txt" 

file1 = open(filename1,"w") 
file2 = open(filename2,"w") 

file1.write(file1Contents) 
file2.write(file2Contents) 

file1.close()
file2.close() 
#end of file build 

lines1 = open(filename1, "r").readlines()
lines2 = open(filename2, "r").readlines()

import difflib

print "\n FILE 1 \n" 
for line in lines1:
  print line 

print "\n FILE 2 \n" 
for line in lines2: 
  print line 

diffSequence = difflib.ndiff(lines1, lines2) 

print "\n ----- SHOW DIFF ----- \n" 
for i, line in enumerate(diffSequence):
    print line

diffObj = difflib.Differ() 
deltaSequence = diffObj.compare(lines1, lines2) 
deltaList = list(deltaSequence) 

print "\n ----- SHOW DELTALIST ----- \n" 
for i, line in enumerate(deltaList):
    print line



#let's suppose we store just the diffSequence in the database 
#then we want to take the current file (file2) and recreate the original (file1) from it
#by backward applying the diff 

restoredFile1Lines = difflib.restore(diffSequence,1)  # 1 indicates file1 of 2 used to create the diff 

restoreFileList = list(restoredFile1Lines)

print "\n ----- SHOW REBUILD OF FILE1 ----- \n" 
# this is not showing anything! 
for i, line in enumerate(restoreFileList): 
    print line

Thanks!

UPDATE:

contextDiffSeq = difflib.context_diff(lines1, lines2) 
contextDiffList = list(contextDiffSeq) 

print "\n ----- SHOW CONTEXTDIFF ----- \n" 
for i, line in enumerate(contextDiffList):
    print line

----- SHOW CONTEXTDIFF -----

* 5,9 **
 print "HIJ"

 print "JKL"

 print "Hello World"
print "j=" + j

print "XYZ"

--- 5,9 ----
 print "HIJ"

 print "JKL"

 print "Hello World"

 print "XYZ"
print "The end"

Another update:

In the old days of Panvalet an Librarian, source management tools for the mainframe, you could create a changeset like this:

++ADD 9
   print "j=" + j

Which simply mean add a line (or lines) after line 9. Then there word words like ++REPLACE or ++UPDATE. http://www4.hawaii.gov/dags/icsd/ppmo/Stds_Web_Pages/pdf/it110401.pdf

Evangelin answered 20/1, 2011 at 4:10 Comment(0)

F

4

Diffs must contain enough information to make it possible to patch a version into another, so yes, for your experiment of a single-line change to a very small document, storing the whole documents could be cheaper.

Library functions return iterators to make it easier on clients that are tight on memory or only need to look at part of the resulting sequence. It's ok in Python because every iterator can be converted to a list with a very short list(an_iterator) expression.

Most differencing is done on lines of text, but it is possible to go down to the char-by-char, and difflib does it. Take a look at the Differ class of object in difflib.

The examples all over the place use human-friendly output, but the diffs are managed internally in a much more compact, computer-friendly way. Also, diffs usually contain redundant information (like the text of a line to delete) to make patching and merging changes safe. The redundancy can be removed by your own code, if you feel comfortable with that.

I just read that difflib opts for least-surprise in favor of optimality, which is something I won't argue against. There are well known algorithms that are fast at producing a minimum set of changes.

I once coded a generic diffing engine along with one of the optimum algorithms in about 1250 lines of Java (JRCS). It works for any sequence of elements that can be compared for equality. If you want to build your own solution, I think that a translation/reimplementation of JRCS should take no more than 300 lines of Python.

Processing the output produced by difflib to make it more compact is also an option. This is an example from a small files with three changes (an addition, a change, and a deletion):

---  
+++  
@@ -7,0 +7,1 @@
+aaaaa
@@ -9,1 +10,1 @@
-c= 0
+c= 1
@@ -15,1 +16,0 @@
-    m = re.match(code_re, text)

What the patch says can be easily condensed to:

+7,1 
aaaaa
-9,1 
+10,1
c= 1
-15,1

For your own example the condensed output would be:

-8,1
+9,1
print "The end"

For safety, leaving in a leading marker ('>') for lines that must be inserted might be a good idea.

-8,1
+9,1
>print "The end"

Is that closer to what you need?

This is a simple function to do the compacting. You'll have to write your own code to apply the patch in that format, but it should be straightforward.

def compact_a_unidiff(s):
    s = [l for l in s if l[0] in ('+','@')]
    result = []
    for l in s:
        if l.startswith('++'):
            continue
        elif l.startswith('+'):
            result.append('>'+ l[1:])
        else:
            del_cmd, add_cmd = l[3:-3].split()
            del_pair, add_pair = (c.split(',') for c in (del_cmd,add_cmd))
            if del_pair[1]  != '0':
                result.append(del_cmd)
            if add_pair[1] != '0':
                result.append(add_cmd)
    return result

Fallfish answered 20/1, 2011 at 5:4 Comment(6)

Thanks (see update and note to @Karl above). I'm probably going to be dealing with short code samples - 10-20 lines in general - so I guess I'm better off to abandon the diff concept. And how would you store the diff in a database blob? Serialize it, or do a ''.join – Evangelin 20/1, 2011 at 5:19

It all depends on the purpose. Why do you want to compare the small documents? How many documents are there? How many revisions to each? If you take control of the diffing engine, then the storage format can be made much more compact than any of the revisions on the average case (think quicksort). If what you want is to use/show the differences at certain points in time, then, for your small documents, you can compute them on-the-fly when you need them with difflib. Do let me know what you decide (I need an excuse to spend a day translating JRCS.Diff to Python). – Fallfish 20/1, 2011 at 5:38

Based on the info here, I'm just going to store the entire chunk of code for now. If my project blossoms, and I get a million users, then I can worry about my cloud-hosting disk-space cost, and then tweak it. If you translate to Python, try to make it not dependent on C so it will run Google App Engine (or two versions). – Evangelin 21/1, 2011 at 3:47

Added "another update" to original post. I woke up this morning think about how tool old mainframe tools used to do this... – Evangelin 21/1, 2011 at 14:28

@Evangelin The format used by RCS/CVS is also very compact, and can be used with many libraries. I think it would be good to play a little more with difflib. For example, you can try using the unified format with zero lines of context and process the output to make it more compact. See my latest edit for an example. – Fallfish 21/1, 2011 at 16:5

@Fallfish Yea, that's pretty much where I might head in the future. But like I said above, let's see if I can even get the site to be successful before I burn cycles on saving disk space. – Evangelin 21/1, 2011 at 16:35

O

6

I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?

Well, think about it for a second - if you compare files, those files can in theory (and will be in practice) be quite large - returning the delta as a list, for exampe, means reading the complete data into memory, which is not a smart thing to do.

As for only returning the difference, well, there is another advantage in using a generator - just iterate over the delta and keep whatever lines you are interested in.

If you read the difflib documentation for Differ - style deltas, you will see a paragraph that reads:

Each line of a Differ delta begins with a two-letter code:
Code    Meaning
'- '    line unique to sequence 1
'+ '    line unique to sequence 2
'  '    line common to both sequences
'? '    line not present in either input sequence

So, if you only want differences, you can easily filter those out by using str.startswith

You can also use difflib.context_diff to obtain a compact delta which shows only the changes.

Oblivion answered 20/1, 2011 at 4:33 Comment(0)

F

4

Diffs must contain enough information to make it possible to patch a version into another, so yes, for your experiment of a single-line change to a very small document, storing the whole documents could be cheaper.

Library functions return iterators to make it easier on clients that are tight on memory or only need to look at part of the resulting sequence. It's ok in Python because every iterator can be converted to a list with a very short list(an_iterator) expression.

Most differencing is done on lines of text, but it is possible to go down to the char-by-char, and difflib does it. Take a look at the Differ class of object in difflib.

The examples all over the place use human-friendly output, but the diffs are managed internally in a much more compact, computer-friendly way. Also, diffs usually contain redundant information (like the text of a line to delete) to make patching and merging changes safe. The redundancy can be removed by your own code, if you feel comfortable with that.

I just read that difflib opts for least-surprise in favor of optimality, which is something I won't argue against. There are well known algorithms that are fast at producing a minimum set of changes.

I once coded a generic diffing engine along with one of the optimum algorithms in about 1250 lines of Java (JRCS). It works for any sequence of elements that can be compared for equality. If you want to build your own solution, I think that a translation/reimplementation of JRCS should take no more than 300 lines of Python.

Processing the output produced by difflib to make it more compact is also an option. This is an example from a small files with three changes (an addition, a change, and a deletion):

---  
+++  
@@ -7,0 +7,1 @@
+aaaaa
@@ -9,1 +10,1 @@
-c= 0
+c= 1
@@ -15,1 +16,0 @@
-    m = re.match(code_re, text)

What the patch says can be easily condensed to:

+7,1 
aaaaa
-9,1 
+10,1
c= 1
-15,1

For your own example the condensed output would be:

-8,1
+9,1
print "The end"

For safety, leaving in a leading marker ('>') for lines that must be inserted might be a good idea.

-8,1
+9,1
>print "The end"

Is that closer to what you need?

This is a simple function to do the compacting. You'll have to write your own code to apply the patch in that format, but it should be straightforward.

def compact_a_unidiff(s):
    s = [l for l in s if l[0] in ('+','@')]
    result = []
    for l in s:
        if l.startswith('++'):
            continue
        elif l.startswith('+'):
            result.append('>'+ l[1:])
        else:
            del_cmd, add_cmd = l[3:-3].split()
            del_pair, add_pair = (c.split(',') for c in (del_cmd,add_cmd))
            if del_pair[1]  != '0':
                result.append(del_cmd)
            if add_pair[1] != '0':
                result.append(add_cmd)
    return result

Fallfish answered 20/1, 2011 at 5:4 Comment(6)

Thanks (see update and note to @Karl above). I'm probably going to be dealing with short code samples - 10-20 lines in general - so I guess I'm better off to abandon the diff concept. And how would you store the diff in a database blob? Serialize it, or do a ''.join – Evangelin 20/1, 2011 at 5:19

It all depends on the purpose. Why do you want to compare the small documents? How many documents are there? How many revisions to each? If you take control of the diffing engine, then the storage format can be made much more compact than any of the revisions on the average case (think quicksort). If what you want is to use/show the differences at certain points in time, then, for your small documents, you can compute them on-the-fly when you need them with difflib. Do let me know what you decide (I need an excuse to spend a day translating JRCS.Diff to Python). – Fallfish 20/1, 2011 at 5:38

Based on the info here, I'm just going to store the entire chunk of code for now. If my project blossoms, and I get a million users, then I can worry about my cloud-hosting disk-space cost, and then tweak it. If you translate to Python, try to make it not dependent on C so it will run Google App Engine (or two versions). – Evangelin 21/1, 2011 at 3:47

Added "another update" to original post. I woke up this morning think about how tool old mainframe tools used to do this... – Evangelin 21/1, 2011 at 14:28

@Evangelin The format used by RCS/CVS is also very compact, and can be used with many libraries. I think it would be good to play a little more with difflib. For example, you can try using the unified format with zero lines of context and process the output to make it more compact. See my latest edit for an example. – Fallfish 21/1, 2011 at 16:5

@Fallfish Yea, that's pretty much where I might head in the future. But like I said above, let's see if I can even get the site to be successful before I burn cycles on saving disk space. – Evangelin 21/1, 2011 at 16:35

S

1

You want to use the unified or context diff if you just want the changes. You're seeing bigger files because it includes the lines they have in common.

The advantage of returning a generator is that the entire thing doesn't need to be held in memory at once. This can be useful for diffing very large files.

Scanderbeg answered 20/1, 2011 at 4:28 Comment(1)

Thanks, I'm still shocked that the context_diff is larger than file2. It seems like the diff would say something like "add 'print "j=" + j ' at position 235" in whatever internal syntax it wanted to use. And can you do a .restore from a context_diff? – Evangelin 20/1, 2011 at 5:16

Recommended topics

Hot tags