Does operational transformation work on structured documents such as HTML if simply treated as plain text?
Asked Answered
B

3

19

The FAQ of Google Wave Protocol says that [HTML] "does not have desirable properties" and that "HTML makes OT (Operational Transforms) difficult if not impossible" [1]. Why is this so? What problems arise if HTML is treated simply as plain text and then OT applied?

  1. http://www.waveprotocol.org/faq#TOC-What-s-the-XML-schema-for-waves-Why
Broncobuster answered 14/4, 2012 at 0:12 Comment(0)
B
17

I'm assuming here you understand the basics of OT. The principal problem with doing OT on HTML as plain text is that of merging html tags. As a simple example, say we had a document as follows:

Hello world

Alice then decides that world should be in bold:

Hello <b>world</b>

This can be represented with a double insert operation in OT, schematically:

Edit A: Keep 6 : Insert "<b>" : Keep 5 : Insert "</b>"

If Bob decided that 'world' should be italic before he saw Alice's edit, he would add the operation

Edit B: Keep 6 : Insert "<i>" : Keep 5 : Insert "</i>"

If the server received Bob's edit after Alice's, it would need to transform B against A to become B'.

The Keep statements are unchanged through transformation, but Insert "" transformed over Insert "" can become either Keep 3 : Insert "" or Insert "" : Keep 3. Usually the server will be configured to place the later edit after the first edit.

Edit B': Keep 6 : Keep 3 : Insert "<i>" : Keep 5 : Keep 3 : Insert "</i>"

Here the problem becomes obvious. Applying A then B' to the original string gives the invalid html:

Hello <b><i>world</b></i>

Theoretically this could be solved by varying pre and post inserts, but this would get hairy for more complicated examples, potentially involving a full document scan for every transformation.

As the other answer noted, this mess can be avoided using out-of band annotations + plain text. Another approach I've only seen so far in academic papers is to treat the XML structure as a tree with OT operations for node addition, deletion, eg:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74

Bronez answered 16/9, 2012 at 8:28 Comment(1)
I guess the fundamental issue here is that with simultaneous inserts, the end result can always be semantically incorrect - but with the case of XML/HTML, the end result can be syntactically incorrect. The use of annotations does nothing to alleviate the semantic inconsistency, but ensures that the transformation will produce valid XML/HTML and as such can always be nicely rendered. Thanks.Gur
H
5

I don't have a complete answer but I'm interested in seeing more work done on making existing open source operational transformation libraries work with rich text, so I'll contribute what I know.

The important difference between HTML and the Wave schema seems to be the way text formatting is marked up: a heirarchy of nested tags for HTML vs. out of band annotations (in the footer of the document) with ranges for Wave XML. Out of band annotations are probably a more natural way to mark up text formatting since they allow overlapping (non-nested) formats. It allows something like this (in pseudo-markup), which would not be valid XML using the nested representation:

(b) This is bold (i) while this range is both bold and italic (/b) and this last bit is just italic (/i)

Related, here is the relevant issue in the ShareJS project. Perhaps they can implement rich text support by adopting part of the Wave XML schema.

Haden answered 14/4, 2012 at 0:12 Comment(0)
Y
2

There are approaches in OT that support SGML (superset of XML), but there are no implementations. Therefore, it is not impossible! Though, I agree, OT is not the best approach to enable XML. This is because OT was designed for linear data structures. But HTML/XML is much more complex: it has attributes, and it is built like a tree. The fact that it is a tree is solvable, but the attributes - which is realized as an ordered associative array - are not supported by OT. Simply because associative arrays are not supported by OT (at the moment). The approach above actually recommends to treat the attributes as a string: E.g. "id='myid' value='mystuff'" But you can easily break the whole syntax of your 'attributes-string', when one user deletes all attributes, and another one inserts a " character directly after "mystuff". This could resolve in some div tag that looks like this <div ">, which is not valid syntax.

Maybe this interests you:

CEFX is a project that aimed to support XML - it's dead to my knowledge. But it uses an OT approach. For some reason it is not possible to edit string - only xml elements.

Google's Drive SDK supports graph-like data structures. It is, however, proprietary and nobody knows how it works.

I am developing a framework that supports arbitrary data structures. Currently, Text, Json, XML, and HTML are supported. It has a different approach: check it out: Yatta!

BTW: What the Wave protocol, and Eric Drechsel described is known as Annotations in OT. It is commonly leveraged to support rich text.

Yellowbird answered 25/8, 2014 at 1:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.