Use CSS selectors to collect HTML elements from a streaming parser (e.g. SAX stream)

Asked 11/1, 2011 at 11:11 Answered 13/6, 2011 at 16:54

How to parse CSS (CSS3) selector and use it (in jQuery-like way) to collect HTML elements not from DOM (from tree structure), but from stream (e.g. SAX), i.e. using sequential access event based parser?

By the way, are there any CSS selectors (or their combination) that need access to DOM (Wikipedia SAX page says that XPath selectors "need to be able to access any node at any time in the parsed XML tree")?

I am most interested in implementing selector combinators, e.g. 'A B' descendant selector.

I prefer solutions describing algorithm, or in Perl (for HTML::Zoom).

Intinction answered 11/1, 2011 at 11:11 Comment(3)

Might be tricky since SAX work top-bottom while CSS selectors are evaluated bottom-top (right to left). But not impossible. – Alluring 14/1, 2011 at 11:42

In which langage do you want to implement it? Do you want to implement a restriction of "css" selector callback? In the past I develop a C++ incremental XML parser (not SAX) for huge xml-file (~500Mb), adding a such feature on top of SAX parser will be not so difficult (at least in c or C++). – Keniakenilworth 18/1, 2011 at 7:45

@VGE: I would prefer Perl, or description of algorithm, but I can read C++ well enough to get to underlying algorithm. – Toitoiboid 21/1, 2011 at 0:3

I would do it with regular expressions.

First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:

A becomes /<A[^>]*>$/
A#someid becomes /<A[^>]*id="someid"[^>]*>$/
A.someclass becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
A > B becomes /<A[^>]*><B[^>]*>$/
A B becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/

And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".

If you need more examples, please let me know.

Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:

<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>

Your stack state string would go through the following values in order as you enter each element:

<x>
<x><a>
<x><y id="boo">
<x><y id="boo"><z class="bar">

The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!

Issues to watch out for:

Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.
Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass which expects <a id="someid" class="someclass"> unfortunately fails when your parser goes into <a class="someclass" id="someid">.
Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.
Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child and A + B. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B; I'm not quite sure how to deal with that one.

EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:

There might be more than one possible transition from any given state
The machine tries one of them, and if that doesn't work out, then it backtracks and tries the other
The easiest way to implement this is to keep a stack for the machine, which you push onto whenever you follow a path and pop from whenever you need to backtrack. It comes down to the same sort of thing you'd use to do a depth-first search.

The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.

Intercurrent answered 20/1, 2011 at 20:14 Comment(1)

Looking back at this I notice a mistake I made; the selector regex examples I wrote will match against tags which merely begin with the desired tag name, so i.e. a selector for 'A' will incorrectly match against a tag named 'Alakazam'. To avoid this you'd have to adjust the [^>]* parts to be something more like (?:| [^>]*), so that it won't start ignoring unimportant attributes until after a space indicates the tag name is finished. – Intercurrent 29/7, 2011 at 2:10

What does a browser do to create the DOM out of a stream? I guess there lies the answer to your question because it must be storing the discovered elements in a form which facilitates a CSS selector query. If you can afford reading the source code for an open source browser parser, then I think you can reuse it.

I would not do so, honestly. Rather I would reuse an existing SAX based parser (maybe you rewrite another with perl), and that go through the entire string. When handlers fire, use them to construct an in memory database for elements. Create a virtual "table" for every element with it's #number [for references] , tagName, parent #number, next #number and its opening tag char offset in the source mother string. As well, create a table for every attribute ever found, and fill it with a record for each tag with that attribute value.

Now it's all about the process of creating a database, tables and indices.

Penalty answered 22/1, 2011 at 4:20 Comment(6)

I'm skeptical. Isn't this essentially just re-inventing the wheel of DOM? – Intercurrent 22/1, 2011 at 16:13

That was why is suggested " reading the source code for an open source browser parser" – Penalty 22/1, 2011 at 19:33

That said, making use of the former implementation made by browsers manufacturers in past. – Penalty 22/1, 2011 at 19:43

This answer is not very useful for someone who actually would like to implement such thing. – Toitoiboid 23/1, 2011 at 22:7

I do like the idea of reinventing wheel - you can make it rounder and if not you'll at least gain experience why it is the way it is... BUT - this is completely out of boundaries! Do not re-implement DOM/HTML parser until you have a strong knowledge of all background technologies – Pit 24/1, 2011 at 8:45

It's not the same as reinventing DOM because the DOM objects don't need to be created. That is generally the advantage of a SAX approach, the downside is you have to drink form the fire hose (parse stream) rather than look at a nice built DOM tree. – Colville 13/6, 2011 at 16:43

Check out nokogiri. From their page:

Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.".

It's in Ruby, but you said you wanted an algorithm and Ruby is great for reading. Or just call it from whatever you are working in.

Colville answered 13/6, 2011 at 16:54 Comment(0)

Recommended topics

Hot tags