Steps and involvement of implementing a parser (in .Net - and in this case XPath 2.0)

H

3

6

In the lack of any good free XPath 2.0 implementations for .Net build upon Linq to XML I have thought about implementing my own (also for the experience). But just to be clear (and not building something that exists) these are the XPath 2.0 implementations I have found:

Saxon .Net
Query Machine - I had problems with this - exceptions with the examples
XQSharp - may be good, but is commercial (single developer ~300 $)

Now, I want some thoughts on how difficult it is to implementing some language such as XPath 2.0 expressions. I have found this link which have a EBNF for XPath 2.0 expression: http://www.w3.org/TR/2007/REC-xpath20-20070123/#id-grammar and I'm thinking of making it in F# with the fslex/fsyacc combination.

My background (subjective): I have played with these tools before, but only for some simple expressions and a very simple programming language. Furthermore, I have read most of the Dragon book and Appel´s Modern compiler implementation in ML - but unfortunately, I have not put the theory in practice while reading. I've studied computer science in a year now where I have completed courses with theory about ex finite automaton, CFL and algorithms but I have been a developer for years before university (a few years with professional jobs - back-end of websites mainly).

Now, the steps of parsing and what I tend to cover:

Lex - Parsing - Reductions: FsLex/FsYacc. I will properly not cover ALL of Xpath 2.0 at first but at least all of what XPath 1.0 can do + a little more.
Sematic analysis - I'm not sure about how much there is to this
Optimization - I do not tend to cover this (at least not at first)
Actual traversing etc.
...?

Now, the concrete questions in addition to the above:

How difficult is it to make a parser of this size? based on my background, would I could to it?
Is there any crucial steps I have missed in regards to XPath 2.0 in particular?
Is there any technology I have missed; Do I have to cover more than just XPath 2.0 and XDocument etc. to be able to make the parser?

To be clear: I want to make a XPath 2.0 expression parser and traverse XDocument etc. with this parsed expression. Which I guess combined is a query engine.

Update: I found this: http://www.w3.org/2007/01/applets/xpathApplet.html which contains code to parsing and traversing. I think it would be a nice start or reference :-)

Your answers will be appreciated.

Highspeed answered 24/8, 2010 at 9:42 Comment(16)

I dont really understand your question. XPath is a query language. It needs no parser, it needs an existing well-formed XML document with schema. The XML schema is what determines the structure of the XML, so in effect, that would be your 'yacc' for XML. That said, .NET all supports this. I see no need to re-invent the wheel here. – Kinnon 24/8, 2010 at 9:45

@Kinnon I may not have been clear in my use of terms. I want to parse //pf:*[@name='some']/@* so it is a XPath 2.0 expression parser I want to make. – Highspeed 24/8, 2010 at 10:4

@lasseespeholt: But why? Is The XPath 2 query engine (which I believe are compiled queries) not working? Or do want to use your little 'dsl' qeuries? – Kinnon 24/8, 2010 at 10:7

@Kinnon What XPath 2.0 query engine in .Net? I want to make an alternative to Saxon .Net for XPath 2.0. I do not understand exactly what you mean, please enlightened me ;-) – Highspeed 24/8, 2010 at 10:18

@lasseespeholt: msdn.microsoft.com/en-us/library/system.xml.xpath.aspx – Kinnon 24/8, 2010 at 10:37

@Kinnon XPathExpression.Compile only supports XPath 1.0. And I think they only supports the XPath 2.0 data model and not the parsing and traversing itself. – Highspeed 24/8, 2010 at 10:44

@lasseespeholt: Where do you see that? From what I can see, it supports XPath 2.0 and XQuery 1.0. – Kinnon 24/8, 2010 at 10:50

@leippe msdn.microsoft.com/en-us/library/ms163317.aspx look at return types under Remarks. It only returns nodeset, bool, string or number. – Highspeed 24/8, 2010 at 11:1

@lasseespeholt: I see. If I may ask, what else do you need? – Kinnon 24/8, 2010 at 11:6

@Kinnon Basically, all the functions here is relevant: w3schools.com/xpath/xpath_functions.asp (for building a testsuite) – Highspeed 24/8, 2010 at 11:20

@lasseespeholt: You are saying a lot of those cant be called from an XPath expression? I am just interested :) – Kinnon 24/8, 2010 at 11:26

@Kinnon I believe so, yes. At least not from a XPath 1.0 expression. – Highspeed 24/8, 2010 at 11:38

@lasseespeholt: The XQSharp page says: "XQSharp is free for non-commercial use" – Fallfish 24/8, 2010 at 12:53

@lasseespeholt: Good question (+1). See my answer for a shared experience. – Fallfish 24/8, 2010 at 13:7

@leppie: At present there is no Microsoft XPath 2.0 engine. SQL Server supports a limited (as per earlier working draft) subset of the XQuery, but this is not a full W3C XPath 2.0 Recommendation - compliant implementation. – Fallfish 24/8, 2010 at 13:14

@Kinnon Yes, I have noted the 'free' version. It might be okay for my project but nevertheless I will keep the question open because it interests me. And a version where I have the source AND not have to update my .dll´s every half year (you have to with the trial) would be nice for me. And also the challenge would be nice - unless it is way to complicated. – Highspeed 24/8, 2010 at 13:19

F

4

I implemented an XPath 2.0 parser entirely in XSLT 2.0 three years ago.

I used my LR Parsing Framework in FXSL and this was not so difficult. The grammar is quite big -- 209 rules, if I remember well. I used a modificationn of YACC (done by me) which I call Yaccx to generate the parsing tables as XML. These are the input for the general LR Parser, written in XSLT.

For such kind of project you need to allocate at least 6 months full time, maybe 1 year. The difficulty is in implementing the enormous function library (F & O).

Also, XPath is not a standalone language -- it must be hosted by another language. Due to this reason I didn't use this parser for anything meaningful, as I didn't have the access, influence and possibility to alter an existing hosting language.

So, be prepared for all these difficulties.

Fallfish answered 24/8, 2010 at 13:6 Comment(2)

+1 The work you have done sounds very interesting. May I ask why you used your own yacc and parseing framework and not just other implementations? I don't have 6 months full time :/ I guess I have a few hours each day but I'm studying currently. Also, the last point seems very rational, my initial usage of this was to make a online xpath tester but if it can't be used to anything besides that and others ain't requesting it it might be waste of time. – Highspeed 24/8, 2010 at 13:27

@lasseespeholt: This is not my own YACC. This is Berkley YACC, only slightly modified to output the parsing tables in XML format. Normally it outputs the parsing tables as C arrays. As for an XPath 2.0 Visualizer, I developed such four years ago and am considering publishing it. – Fallfish 24/8, 2010 at 13:42

T

5

I am one of the developers of XQSharp, so I have experience in this area. XQSharp actually began its life as an XPath implementation before we expanded it to support XQuery.

Our initial implementation took us about 6 months, although this was not the only thing we were working on at the time.

After this time we had an implementation that was feature complete. There were many areas in which this was not fully conformant, where the standard .NET methods did not behave quite the same as the specification required. Some examples of this are with converting values to and from strings, regular expressions, a lot of unicode stuff, problems with the .NET representations of XML (eg handling of xml:base) and so on.

There were several areas that needed to be done to implement this:

Parsing: The parser itself was straightforward, and mostly generated from the EBNF in the spec. I would estimate that this initially represented a few weeks work.

Data Model: How the data is represented. In order to have a full XPath implementation there are a lot of new data types (like xs:gDay) that need to be implemented. In our case we have all our items derive from a base type and all our expressions would return enumerators of these. You also need to be able to identify whether the type of an item matches a particular XPath type. We supported static typing and schema-awareness from the outset, without these features this section probably becomes trivial, but you are still looking at several weeks worth of work.

Expressions/Abstract Syntax Tree This is the model of the expression itself. We used the XQuery Formal Semantics document to produce a mapping from the various XPath constructs (for example axes and predicates) to a simpler core grammer (which ends up with huge amounts of let, for if and typeswitch expressions!). In our initial implementation all these expressions had evaluate methods and so were the final representation of the expression. In our case the expressions all had type check methods too, but this can be skipped initially (The main purpose of these is for optimization). Creating all these expressions again took several weeks.

Functions As a previous commenter pointed out the function library for XPath is rather large. The entire XPath library took us several months to implement.

Static Analysis A small amount of static analysis is required. Variable references and function calls must be bound to the correct variables and functions. Most XPath implementations are stack based, and so a stack allocation phase is required to assign pointers (or indexes) to all the variables. This static analysis took a week or two. The Dragon Book should set you up very nicely to solve most of these problems.

You are probably looking at another month's worth of work for all the extra bits of work that do not fall directly into these categories.

After all this work we were left with a mostly functional implementation of XPath; but it was far to slow for real world use (maybe 100x slower than XPath 1 in .NET). So after this comes the fun work - Optimization.

Bringing the engine up to 100% conformance and adding optimizations probably took another 12-18 man months (although we probably went a little overboard with optimization!), but by that point we had already made the transition to being an XQuery implementation.

My advice would be to start by tackling a subset of XPath (maybe forward axes only and a very limited function library) and you might be able to knock together an implementation in a month or two, but a serious XPath2 implementation will be a big investment in time.

Make sure that you use XPathNavigator for your node representation, as it has methods like SelectChildren which can take advantages of indexes in the underlying representations (for example XPathDocument).

Tidbit answered 27/8, 2010 at 17:5 Comment(3)

+1 I really appreciate you took the time to write about it :) It seems like a long journey. I did think it was a smaller project than that but I often do. Right now, I'll return to the studies and use XQuery for non-commercial stuff (at least for now). Thanks... – Highspeed 27/8, 2010 at 22:57

If I may add a little suggestion to XQuery, then I really think you guys should make your LINQ equivalent methods like XPathEvaluate, XPathSelect etc. behave just like the .Net XPath 1.0 version. – Highspeed 27/8, 2010 at 22:59

@lasseespeholt: I don't think we realised how big a journey this was when we started either! What differences are you referring to with the behaviour of the extension methods? If you could post this to our forum (xqsharp.com/forum) this would be greatly appreciated. – Tidbit 31/8, 2010 at 9:39

F

4

I implemented an XPath 2.0 parser entirely in XSLT 2.0 three years ago.

I used my LR Parsing Framework in FXSL and this was not so difficult. The grammar is quite big -- 209 rules, if I remember well. I used a modificationn of YACC (done by me) which I call Yaccx to generate the parsing tables as XML. These are the input for the general LR Parser, written in XSLT.

For such kind of project you need to allocate at least 6 months full time, maybe 1 year. The difficulty is in implementing the enormous function library (F & O).

Also, XPath is not a standalone language -- it must be hosted by another language. Due to this reason I didn't use this parser for anything meaningful, as I didn't have the access, influence and possibility to alter an existing hosting language.

So, be prepared for all these difficulties.

Fallfish answered 24/8, 2010 at 13:6 Comment(2)

+1 The work you have done sounds very interesting. May I ask why you used your own yacc and parseing framework and not just other implementations? I don't have 6 months full time :/ I guess I have a few hours each day but I'm studying currently. Also, the last point seems very rational, my initial usage of this was to make a online xpath tester but if it can't be used to anything besides that and others ain't requesting it it might be waste of time. – Highspeed 24/8, 2010 at 13:27

@lasseespeholt: This is not my own YACC. This is Berkley YACC, only slightly modified to output the parsing tables in XML format. Normally it outputs the parsing tables as C arrays. As for an XPath 2.0 Visualizer, I developed such four years ago and am considering publishing it. – Fallfish 24/8, 2010 at 13:42

P

3

To address your third concrete question, the Dragon Book makes no mention of Parsing Expression Grammars (PEGs)/Packrat parsers/parser combinator libraries, which are quite the rage now, especially when it comes to functional languages. See FParsec, for example.

Polytypic answered 24/8, 2010 at 9:46 Comment(1)

+1 I have never encountered PEGs (in class CFL and reg.) so I really appreciate your answer and will look into the tool :) – Highspeed 24/8, 2010 at 10:15

Recommended topics

Hot tags