(FInite State Machine) - Implementing a XML schema validator in javascript
Asked Answered
S

1

5

I have been working on a project for a month or so now to develop a XML validator (XSD) in javascript. I have gotten really close but keep running into problems.

The only thing I have working well is normalizing schema structures into FSA that I store in the DOM. I have tried several methods to validate my xml structures against the FSA and come short each time.

The validator is being used to run a client side WYSIWYG XML editor so it has to meet the following requirements

  • Must be efficient ( < 15ms to validate an element child node pattern even with complex models)
  • Must expose a Post Validation Schema Infoset (PSVI) which can be queried to determine what elements can be inserted/removed from the document at various points and still keep the document valid.
  • Must be able to validate a xml child node structure and if invalid return what content was EXPECTED or what content is UNEXPECTED.

-- More info Consider the following example--
First I convert schema structures to a general FSA representation normalizing out things like xs:group and xs:import with respect to namespaces. For instance consider:

<xs:group name="group1">
    <xs:choice minOccurs="2">
         <xs:element name="e2" maxOccurs="3"/>
         <xs:element name="e3"/>
    </xs:choice>
</xs:group>
<xs:complexType>
    <xs:seqence>
        <xs:element name="e1"/>
        <xs:group ref="group1"/>
    </xs:sequence>
<xs:complexType>

Would be converted into a similar generalized structure:

<seq>
    <e name="e" minOccurs="2"/>
    <choice minOccurs="2">
         <e name="e2" maxOccurs="3"/>
         <e name="e3"/>
    </choice>
</seq>

I do this all server side through XQuery and XSLT.

My first attempt at building a validator was with recursive functions in javascript. Along the way if I found content that could exist I would add it to a global PSVI signaling that it could be added at a specified point in the hierarchy.

My second attempt was iterative, and was much faster but both of these suffered from the same problem.

Both of these could correctly validate simple content models, but as soon as the models became more complex and very nested they failed.

I am thinking that I am approaching this problem from the completely wrong direction. From what I have read most FSA's are processed by pushing states to a stack, but I am not sure how to do this in my situation.

I need advice on the following questions:

  1. Is a state machine the right solution here, will it acomplish the goals stated at the top.?
  2. If using a state machine whats the best method to convert the schema structure to DFA? Thompson algorithm? Do I need to optimize the DFA for this to work.
  3. Whats the best way (or most efficient way) to implement this all in javascript (Note optimizations, and pre-processing can be done on the server)

Thanks,

Casey

Additional Edits:

I have been looking at the tutorial here: http://www.codeproject.com/KB/recipes/OwnRegExpressionsParser.aspx focused on regular expressions. It seems to be very similar to what I need but focused on building a parser for regex. This brings up some interesting thoughts.

I am thinking that xml schema breaks down into only a few operators:

sequence -> Concatination
choice -> Union
minOccurs/maxOccurs - Probably need more than Kleene Closure, not totally sure the best way to represent this operator.

Stichous answered 5/8, 2010 at 20:6 Comment(4)
Welcome to SO. This is much too broad a question for SO. I suggest you approach this from the bottom up. Tell us what you mean by "come short each time" and show what code you have written that you think is the problem. Do this by editing your post, not by creating a new question.Chuppah
Thanks, I tried to add more information. It's tough to convey it all via this media though. I could post the code but its several hundred lines long, and I am apt to believe that its my approach that is the problem. I need help formulating a strategy to attack this problem from a solid angle. Thanks!Stichous
Doing this using a stack or recursively is really equivalent, though I can believe that the stack may perhaps be faster. The most important question is really why you think you need a non-deterministic state machine? Non-deterministic schemas are illegal, you may want to read up on the "unique particle attribution" rule.Godfree
That was a mistake on my part, I am aware schema's must be deterministic. I have edited my post. I think the big problem here is that I am not an expert on state machines. I am having trouble seeing the solution. I need some advice on how to start putting these pieces together. Hopefully my clarification helps.Stichous
L
5

When I was going through the same learning process I found that I needed to spend some time studying books on compiler-writing (for example Aho & Ullman). The construction of a finite state machine to implement a grammar is standard textbook stuff; it's not easy or intuitive, but it is thoroughly described in the literature - except perhaps for numeric minOccurs/maxOccurs, which don't occur in typical BNF language grammars, but are well covered by Thompson and Tobin.

Lorgnon answered 9/8, 2010 at 19:0 Comment(1)
Micheal, Funny you mention that I am in the midst of sorting through Aho & Ullman's book at the moment. I have to admit though, its not my forte. That being said, would you consider the goals I have outlined at the top to be reasonable giving the algorithms described by Thompson & Tobin. I just want to make sure I am not heading down the wrong path here. I need to be able to know more about the validation process than just pass/fail. Thanks, CaseyStichous

© 2022 - 2024 — McMap. All rights reserved.