compare text and get differences
Asked Answered
P

2

7

Well i want to compare 2 strings (version one and version two) and get the differences in a format that i can convert to html on my own, like you can view how a post was edited here on stackoverflow or like svn tracks differences between revisions....

It must be full managed code library.

Like this JavaScript but i need to do it on the server-side..

Peach answered 18/7, 2011 at 11:10 Comment(2)
This question should help: #138831Fanni
Exact duplicate of any decent text diff merge engine for .Net. Voting to close (no offence meant Petoj, that's the policy for duplicates)Scalp
M
6

I have a class library that does this, I'll post a link below, but I'll also post how it does its job so that you can evaluate whether it will be fitting for your content.

Note that for everything I say below, if you think of each character as an element of a collection, you can implement the algorithm described below for any type of content. Be it characters of a string, lines of text, collections of ORM-objects.

The whole algorithm revolves around longest-common-substring (LCS), and is a recursive approach.

First the algorithm tries to find the LCS between the two. This will be the longest section that is unchanged/identical between the two versions. The algorithm then considers these two parts to be "aligned".

For instance, here's how two example strings would be aligned:

      This long text has some text in the middle that will be found by LCS
This extra long text has some text in the middle that should be found by LCS
          ^-------- longest common substring --------^

Then it recursively applies itself to the portions before the aligned section, and the portion afterwards.

The final "result" could look like this (I'm using the underscore to indicate portions "not there" in one of the strings):

This ______long text has some text in the middle that ______will be found by LCS
This extra long text has some text in the middle that should____ be found by LCS

Then, as part of the recursive approach, each level of recursive call will return a collection of "operations", which based on whether there's a LCS, or missing portions in either part, will spit out as follows:

  • If LCS, then it is a "copy" operation
  • If missing from first, then it is a "insert" operation
  • If missing from second, then it is a "delete" operation

So the above text would be:

  1. Copy 5 characters (This)
  2. Insert extra_ (apparently code-blocks here remove space, the underscore is a space)
  3. Copy 43 characters (long text has some text in the middle that_)
  4. Insert should
  5. Delete 4 characters (will)
  6. Copy 16 characters (_be found by LCS)

The core of the algorithm is quite simple, and with the above text, you should be able to implement it yourself, if you want to.

There are some extra features in my class library, in particular to handle such things as content that is similar to the changed text, so that you don't just get delete or insert operations, but also modify operations, this will mostly be important if you're comparing a list of something, like lines from text files.

The class library can be found here: DiffLib on GitHub, and you will also find it on Nuget for easy installation in Visual Studio 2010. It is written in C# for .NET 3.5 and up, so it will work for .NET 3.5 and 4.0, and since it is a binary release (all source code is on GitHub though), you can use it from VB.NET as well.

Muscolo answered 18/7, 2011 at 11:25 Comment(1)
Note that my class library does not implement patch generation or merge logic. If you need that, you're going to have to look elsewhere. Though you didn't mention you needed any such code, I just thought I should mention it so that you're not wasting your time barking up the wrong tree (if it is the wrong three that is.)Muscolo
F
9

Google has something similar and it is available in C#, but have not looked at it any deeper. The demo looks pretty cool though.

http://code.google.com/p/google-diff-match-patch/

Frisbee answered 18/7, 2011 at 11:30 Comment(1)
I would expect this solution to perform and behave better than my implementation to be honest, at least for text, and they also have patch logic.Muscolo
M
6

I have a class library that does this, I'll post a link below, but I'll also post how it does its job so that you can evaluate whether it will be fitting for your content.

Note that for everything I say below, if you think of each character as an element of a collection, you can implement the algorithm described below for any type of content. Be it characters of a string, lines of text, collections of ORM-objects.

The whole algorithm revolves around longest-common-substring (LCS), and is a recursive approach.

First the algorithm tries to find the LCS between the two. This will be the longest section that is unchanged/identical between the two versions. The algorithm then considers these two parts to be "aligned".

For instance, here's how two example strings would be aligned:

      This long text has some text in the middle that will be found by LCS
This extra long text has some text in the middle that should be found by LCS
          ^-------- longest common substring --------^

Then it recursively applies itself to the portions before the aligned section, and the portion afterwards.

The final "result" could look like this (I'm using the underscore to indicate portions "not there" in one of the strings):

This ______long text has some text in the middle that ______will be found by LCS
This extra long text has some text in the middle that should____ be found by LCS

Then, as part of the recursive approach, each level of recursive call will return a collection of "operations", which based on whether there's a LCS, or missing portions in either part, will spit out as follows:

  • If LCS, then it is a "copy" operation
  • If missing from first, then it is a "insert" operation
  • If missing from second, then it is a "delete" operation

So the above text would be:

  1. Copy 5 characters (This)
  2. Insert extra_ (apparently code-blocks here remove space, the underscore is a space)
  3. Copy 43 characters (long text has some text in the middle that_)
  4. Insert should
  5. Delete 4 characters (will)
  6. Copy 16 characters (_be found by LCS)

The core of the algorithm is quite simple, and with the above text, you should be able to implement it yourself, if you want to.

There are some extra features in my class library, in particular to handle such things as content that is similar to the changed text, so that you don't just get delete or insert operations, but also modify operations, this will mostly be important if you're comparing a list of something, like lines from text files.

The class library can be found here: DiffLib on GitHub, and you will also find it on Nuget for easy installation in Visual Studio 2010. It is written in C# for .NET 3.5 and up, so it will work for .NET 3.5 and 4.0, and since it is a binary release (all source code is on GitHub though), you can use it from VB.NET as well.

Muscolo answered 18/7, 2011 at 11:25 Comment(1)
Note that my class library does not implement patch generation or merge logic. If you need that, you're going to have to look elsewhere. Though you didn't mention you needed any such code, I just thought I should mention it so that you're not wasting your time barking up the wrong tree (if it is the wrong three that is.)Muscolo

© 2022 - 2024 — McMap. All rights reserved.