How does a 'diff' algorithm work, e.g. in VCDIFF and DiffMerge? [closed]
Asked Answered
Q

5

183

I've been looking like crazy for an explanation of a diff algorithm that works and is efficient.

The closest I got is this link to RFC 3284 (from several Eric Sink blog posts), which describes in perfectly understandable terms the data format in which the diff results are stored. However, it has no mention whatsoever as to how a program would reach these results while doing a diff.

I'm trying to research this out of personal curiosity, because I'm sure there must be tradeoffs when implementing a diff algorithm, which are pretty clear sometimes when you look at diffs and wonder "why did the diff program chose this as a change instead of that?"...

Where can I find a description of an efficient algorithm that'd end up outputting VCDIFF?
By the way, if you happen to find a description of the actual algorithm used by SourceGear's DiffMerge, that'd be even better.

Note: longest common subsequence doesn't seem to be the algorithm used by VCDIFF. It looks like they're doing something smarter, given the data format they use.

Quits answered 30/4, 2009 at 6:37 Comment(8)
Nothing on wikipedia ? You can maybe try to find another implementation in a hight level langage like python, that might be easier to understand than a C implementation. Python is famous for being easily readable ? There's a difflib in python. Here's the url to the source. The source has tons of comments about diff algorithms. svn.python.org/view/python/trunk/Lib/…Hazelwood
RFCs are not meant to describe algorithms. They are meant to describe interfaces(/protocols).Actinotherapy
Actually, the core of the diff algorithm, the longest common sub-sequence problem, can be found on Wikipedia. This page gives an overview of the algorithm and sample code that I found helpful when I needed to write a custom diff: en.wikipedia.org/wiki/Longest_common_subsequence_problemFogy
Perhaps this will help: paulbutler.org/archives/a-simple-diff-algorithm-in-php It sure is awesome, and it's so small (only 29 lines altogether; it has 2 functions). It's similar to Stack Overflow's edit revision compare thing.Hardman
VCDIFF is not for human readable diffs. It employs add, copy and run instructions as opposed to the more human readable delete and insert instructions emitted by most plain text diff algorithms. For VCDIFF you need something like the xdelta algortihm described here xmailserver.org/xdfs.pdfLinolinocut
Here is C# implementation. It includes lots of unit tests. As a bonus it has a highlighter, so you can prepare data to show diff in the UI. It also has a web project that you can run and test how it works interactively.Swordsman
My daily reminded of when stackoverflow closes important and popular questions.Killerdiller
some popular ones are enumerated in this config for the popular software Beyond Compare: https://mcmap.net/q/137638/-how-does-the-beyond-compare-diff-algorithm-workGaze
N
199

An O(ND) Difference Algorithm and its Variations (1986, Eugene W. Myers) is a fantastic paper and you may want to start there. It includes pseudo-code and a nice visualization of the graph traversals involved in doing the diff.

Section 4 of the paper introduces some refinements to the algorithm that make it very effective.

Successfully implementing this will leave you with a very useful tool in your toolbox (and probably some excellent experience as well).

Generating the output format you need can sometimes be tricky, but if you have understanding of the algorithm internals, then you should be able to output anything you need. You can also introduce heuristics to affect the output and make certain tradeoffs.

Here is a page that includes a bit of documentation, full source code, and examples of a diff algorithm using the techniques in the aforementioned algorithm.

The source code appears to follow the basic algorithm closely and is easy to read.

There's also a bit on preparing the input, which you may find useful. There's a huge difference in output when you are diffing by character or token (word).

Noctambulous answered 21/8, 2009 at 17:23 Comment(2)
In case the link goes bad, this is Myers 1986; see e.g. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.6927 -- it further includes a link to the Unix diff paper by Hunt and McIlroy.Thunderhead
The link did indeed go bad, so I've changed it to ^ (where I've verified the paper is still available for download as of August 2022) and added the author and date to the body of the answer as well.Faris
N
36

I would begin by looking at the actual source code for diff, which GNU makes available.

For an understanding of how that source code actually works, the docs in that package reference the papers that inspired it:

The basic algorithm is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, 'Algorithmica' Vol. 1 No. 2, 1986, pp. 251-266; and in "A File Comparison Program", Webb Miller and Eugene W. Myers, 'Software--Practice and Experience' Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985, pp. 100-118.

Reading the papers then looking at the source code for an implementation should be more than enough to understand how it works.

North answered 30/4, 2009 at 6:41 Comment(4)
Hmmm, in short, sometimes figuring out the underlying algorithm from actual source code (especially if it's optimized to be efficient) can be quite complex. I will be able to understand what the program is doing step by step, but not exactly "why", or a high level overview about that... Example: You'd never understand how regular expressions work (or what they are) by looking at the implementation of Perl's Regexes. Or if you could do that, then I tip my hat, I definitely need a more explained, higher level overview to figure out what's going on.Quits
I never understand how the vast majority of Perl works :-), but there's a link in the package docs (see update) that will point you to the literature describing it (it being the diff algorithm, not Perl).North
Don't read the code. Read the paper.Halakah
Two other sources are mentioned in the docs of that package: - The algorithm was independently discovered as described by Esko Ukkonen in "Algorithms for Approximate String Matching", ‘Information and Control’ Vol. 64, 1985, pp. 100-118, <dx.doi.org/10.1016/S0019-9958(85)80046-2>. - Related algorithms are surveyed by Alfred V. Aho in section 6.3 of "Algorithms for Finding Patterns in Strings", ‘Handbook of Theoretical Computer Science’ (Jan Van Leeuwen, ed.), Vol. A, ‘Algorithms and Complexity’, Elsevier/MIT Press, 1990, pp. 255-300.Rosaliarosalie
P
31

See Diff Match and Patch

"The Diff Match and Patch libraries offer robust algorithms to perform the operations required for synchronizing plain text. ... Currently available in Java, JavaScript, C++, C# and Python"

Also see the Wikipedia Diff page and "Bram Cohen: The diff problem has been solved"

Percentage answered 27/8, 2009 at 4:7 Comment(1)
Just wanted to mention that Cohen's algorithm also seem's to be known as Patience Diff. It's the (default?) diff algorithm in bazaar and an optional one in git.Newmark
M
13

I came here looking for the diff algorithm and afterwards made my own implementation. Sorry I don't know about vcdiff.

Wikipedia: From a longest common subsequence, it's only a small step to get diff-like output: if an item is absent in the subsequence but present in the original, it must have been deleted. (The '–' marks, below.) If it is absent in the subsequence but present in the second sequence, it must have been added in. (The '+' marks.)

Nice animation of the LCS algorithm here.

Link to a fast LCS Ruby implementation here.

My slow and simple Ruby adaptation is below.

def lcs(xs, ys)
  if xs.count > 0 and ys.count > 0
    xe, *xb = xs
    ye, *yb = ys
    if xe == ye
      return [xe] + lcs(xb, yb)
    end
    a = lcs(xs, yb)
    b = lcs(xb, ys)
    return (a.length > b.length) ? a : b
  end
  return []
end

def find_diffs(original, modified, subsequence)
  result = []
  while subsequence.length > 0
    sfirst, *subsequence = subsequence
    while modified.length > 0
      mfirst, *modified = modified
      break if mfirst == sfirst
      result << "+#{mfirst}"
    end
    while original.length > 0
      ofirst, *original = original
      break if ofirst == sfirst
      result << "-#{ofirst}"
    end
    result << "#{sfirst}"
  end
  while modified.length > 0
    mfirst, *modified = modified
    result << "+#{mfirst}"
  end
  while original.length > 0
    ofirst, *original = original
    result << "-#{ofirst}"
  end
  return result
end

def pretty_diff(original, modified)
  subsequence = lcs(modified, original)
  diffs = find_diffs(original, modified, subsequence)

  puts 'ORIG      [' + original.join(', ') + ']'
  puts 'MODIFIED  [' + modified.join(', ') + ']'
  puts 'LCS       [' + subsequence.join(', ') + ']'
  puts 'DIFFS     [' + diffs.join(', ') + ']'
end

pretty_diff("human".scan(/./), "chimpanzee".scan(/./))
# ORIG      [h, u, m, a, n]
# MODIFIED  [c, h, i, m, p, a, n, z, e, e]
# LCS       [h, m, a, n]
# DIFFS     [+c, h, +i, -u, m, +p, a, n, +z, +e, +e]
Monogram answered 28/3, 2013 at 21:23 Comment(0)
W
10

Based on the link Emmelaich gave, there is also a great rundown of diff strategies on Neil Fraser's website (one of the authors of the library).

He covers basic strategies and towards the end of the article progresses to Myers' algorithm and some graph theory.

Whitish answered 8/10, 2009 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.