I would do it with regular expressions.
First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:
A
becomes /<A[^>]*>$/
A#someid
becomes /<A[^>]*id="someid"[^>]*>$/
A.someclass
becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
A > B
becomes /<A[^>]*><B[^>]*>$/
A B
becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/
And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".
If you need more examples, please let me know.
Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:
<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>
Your stack state string would go through the following values in order as you enter each element:
<x>
<x><a>
<x><y id="boo">
<x><y id="boo"><z class="bar">
The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!
Issues to watch out for:
Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.
Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass
which expects <a id="someid" class="someclass">
unfortunately fails when your parser goes into <a class="someclass" id="someid">
.
Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.
Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child
and A + B
. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B
; I'm not quite sure how to deal with that one.
EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:
- There might be more than one possible transition from any given state
- The machine tries one of them, and if that doesn't work out, then it backtracks and tries the other
- The easiest way to implement this is to keep a stack for the machine, which you push onto whenever you follow a path and pop from whenever you need to backtrack. It comes down to the same sort of thing you'd use to do a depth-first search.
The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.