Can I use a plaintext diff algorithm for tracking XML changes?

Asked 12/3, 2010 at 2:16 Answered 12/3, 2010 at 9:15

I'm working in Flex/AS3 on (for simplicity) an XML editor. I need to provide undo/redo functionality.

Of course, one solution is to store the entire source text with each edit. However, to conserve memory, I'd like to store the diffs instead (these diffs will also be used to transmit updates to the server for auto-saving).

My question is - can I use a plaintext diff algorithm for tracking these XML changes?

My research on the internet indicates that I cannot do so. However, I'm obviously missing something. Plaintext diff provides functionality that is purportedly:

diff(text, text') -> diffs
patch(text, diffs) -> text'

XML is simply text, so why can't I just use diff() and patch() to transform the text reliably?

For example: Let's say that I'm a poet. When I write poetry, I use lots of funky punctuation... You know, like <, /, and >. (You might see where I'm going with this...) If I'm writing my poetry in an application that uses diffs to provide undo/redo functionality, does my poetry become garbled when I undo/redo my edits? It's just text! Why does it make a difference to the algorithm?

I obviously don't get something here...Thanks for explaining! :)

UPDATE:

Some discussion I've encountered regarding diffing XML with a plaintext algorithm:

Also, I understand that a Command pattern is likely a better way to implement Undo/Redo. I've simplified my use case for the sake of simplicity, and I do still think that XML diffing is the best approach.

Mithridatism answered 12/3, 2010 at 2:16 Comment(6)

Please point to what makes you think you cannot use this approach. The only reason I see on brief consideration would be if you were attempting partial or out of order undo/redo. I can think of more concise approaches but that's another question. – Pelite 12/3, 2010 at 2:31

@rinogo: both of the links you posted are about comparing HTML. If you're comparing well-formed XML, then it's a different story, as the diff tool can make assumptions. – Mimas 12/3, 2010 at 3:8

John, I think you're onto something. Come to think of it, everything that I've read seems to suggest that problems crop up when the HTML/XML/whatever is malformed. Thankfully, my XML is well-formed. However, that still begs the question: Why exactly does malformed XML cause problems with diff/patch? Diff/patch knows nothing about XML/HTML/any format specification... ? – Mithridatism 12/3, 2010 at 3:11

P.S. If someone can provide a specific example if XML (or even malformed HTML) will fail somehow in a diff/patch scenario, I'd be ecstatic! The code.google.com page provides a half-example, but it's too vague to make much sense to me. Thanks! – Mithridatism 12/3, 2010 at 3:18

@rinogo: most programs I've ever written to process XML would throw an exception on malformed XML. Does that count as "failing in a diff/patch scenario"? – Mimas 12/3, 2010 at 3:29

John, if the diff/patch results in incorrect malformed XML, then yes, this counts as a failure. However, my understanding of diffing/patching is that the text transformation is always performed exactly as specified. (In other words, it does not spontaneously introduce invalid XML) – Mithridatism 12/3, 2010 at 8:26

I'm the author of the plain-text diff/match/patch library from Google.

The key question is whether your patches are exact. In an ideal world:

  diff(old_text, new_text) -> edits
  patch(edits, old_text) -> new_text

Notice that the base text (old_text) is the same in both operations. In this ideal case, then a simple plain-text diff and patch will work perfectly, regardless of the type of the content. If this case applies to you, then you're done.

The issue lies with fuzzy patching. Here's the corresponding example:

  diff(old_text, new_text) -> edits
  patch(edits, old_forked_text) -> new_forked_text

Notice that the base text is not the same in both operations. They should be similar, but the patch operation now has to use "judgement" about what it should do. Some patches may fit perfectly as specified in the edit, others may need to be tweaked for position, others may need to be tweaked for altered context, others may not fit at all and should be dropped. If your patching algorithm is not aware of the structure of XML when making its decisions, you may very well end up with malfromed XML. Here's a sample:

  old_text = Jabberwock<SPAN>Hello<SPAN>World</SPAN></SPAN>
  new_text = Jabberwock<DIV>Hello<SPAN>World</SPAN></DIV>
  diff(old_text, new_text) -> edits
  edits = ["SPAN" -> "DIV" @ character 11,
           "SPAN" -> "DIV" @ character 41]
  old_forked_text = <SPAN>Hello<SPAN>World</SPAN></SPAN>
  patch(edits, old_forked_text) -> new_forked_text
  new_forked_text = <SPAN>Hello<DIV>World</SPAN></DIV>

Let's look at this one carefully. The original diff returned two edits, change the outermost SPAN to a DIV. Simple change. Unfortunately the text this edit is being applied to has changed from the original. The word "Jabberwock" has been removed. Now the first SPAN->DIV change matches up with the second SPAN tag, not the first one. Since the patch algorithm is not aware of the rules of XML, it results in illegally nested tags.

There are some hacks which allow you to guarantee valid XML when using a plain-text patch, but they result in some loss of flexibility (the original question already has a link to the wiki page I wrote about this). The ultimate solution for patching XML is of course to use an XML-aware diff and patch algorithm. These are significantly more complicated and expensive, but they exist. Google the names Tancred Lindholm and Sebastian Rönnau for the great work that they have done in the XML field (particularly with regards to DocEng).

Let me know if there is anything else I can add.

-- Neil Fraser

Dagnydago answered 12/3, 2010 at 9:15 Comment(3)

Brilliant, Neil, just Brilliant! Thank you, my friend, for registering here just to answer this question. Have a greet weekend! -RIch – Mithridatism 12/3, 2010 at 17:1

Would diff/match/patch possibly work with some XML-specific subset of something like the ESIS output from nsgmls? jclark.com/sp/sgmlsout.htm – Poyssick 6/1, 2014 at 20:18

This is an old, old question, but thank you for your long answer back then. You refer to a link in the original question - I believe this is the corresponding GitHub link, right? github.com/google/diff-match-patch/wiki/… Do you have a suggestion for a (fuzzy) patching technique for applying, and re-applying, your own set of updates to a program that is being regularly updated by others. I.e., working with a "moving target", and where the basic unit of change is one or more complete lines of text, never less. And no final merge is ever expected. – Lindeman 26/3, 2018 at 21:22

I use Beyond Compare all the time to compare XML documents. It understands XML, to a certain degree.

You may need to pre-process the two documents in order for textual comparison to do the best job possible. For instance, in some XML documents, the order of some elements may not matter. It will certainly matter to your diff tool! You may need to pre-process the XML using an XML Transform that sorts these elements into a common order in both files, before comparing the two sorted files.

You're also going to want to use the same indentation for both documents. I find it useful to start each element on a new line, and to use the same amount of indentation, with spaces, for each level. If your document gets very deep, you would want to use only one or two spaces per level, so that the compare fits on the screen. You may even want to use one attribute per line (and to sort the attributes into a common order).

Mimas answered 12/3, 2010 at 2:31 Comment(0)

If you are the sole "owner" of the data between your undo/redo points then of course you can use plaintext diff for them. As you point out, it amounts to a set of transformations.

Depending on the operations you provide, however, plaintext diff may not be remotely near optimal for recording undo/redo and you may need to specialise certain cases. Imagine just recording a ReplaceAll command which might be only a few bytes overhead plus the search and replace string. That could generate massive plaintext diffs.

In the broader context, if you allow external editing of these documents, and you're thinking more about how to store deltas on the server, you're mimicking git or other version control systems. You have to use some kind of diff algorithm because just recording your commands is obviously not the only source of transformation. At this point you're starting to mix undo/redo with version control and you may want to think hard about confusing those concepts for your users.

I would keep undo/redo as within an editing session and ban external editing whilst the file is open. That allows you to optimise your command recording for broad cases as I said above.

Beyond that, either use conventional version control (consider wrapping git) or implement your own way of coping with files being changed outside your editor.

Sidereal answered 12/3, 2010 at 2:43 Comment(0)

-1

I think you can use text diff for xml especially in your case where human being will write the xml line by line. I don't know what information you got saying you cannot do that but I guess that statement was based on the fact that space characters (space, tab, newline ...) are somewhat different that they are in a plain text file, which could result in two different text files are identical from a XML perspective. But again, for an editor targeting human being, I don't see why you cannot.

Shrunken answered 12/3, 2010 at 2:27 Comment(0)

Recommended topics

Hot tags