HTML/XML is divided into markup and content.
Regex is only useful doing a lexical tag parse.
I guess you could deduce the content.
It would be a good choice for a SAX parser.
Tags and content could be delivered to a user
defined function where nesting/closure of elements
can be kept track of.
As far as just parsing the tags, it can be done with
regex and used to strip tags from a document.
Over years of testing, I've found the secret to the
way browsers parse tags, both well and ill formed.
The normal elements are parsed with this form:
The core of these tags use this regex
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
You'll notice this [^>]?
as one of the alternations.
This will match unbalanced quotes from ill-formed tags.
It is also, the single most root of all evil to regular expressions.
The way it's used will trigger a bump-along to satisfy it's greedy, must-match
quantified container.
If used passively, there is never a problem
But, if you force something to match by interspersing it with
a wanted attribute/value pair, and don't provide adequate protection
from backtracking, it's an out of control nightmare.
This is the general form for just plain old tags.
Notice the [\w:]
representing the tag name?
In reality, the legal characters representing the tag name
are an incredible list of Unicode characters.
<
(?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
>
Moving on, we also see that you just can't search for a specific tag
without parsing ALL tags.
I mean you could, but it would have to use a combination of
verbs like (*SKIP)(*FAIL) but still all tags have to be parsed.
The reason is that tag syntax may be hidden inside other tags, etc.
So, to passively parse all tags, a regex is needed like the one below.
This particular one matches invisible content as well.
As new HTML or xml or any other develop new constructs, just add it as
one of the alternations.
Web page note - I've never seen a web page (or xhtml/xml) that this
had trouble with. If you find one, let me know.
Performance note - It's quick. This is the fastest tag parser I've seen
(there may be faster, who knows).
I have several specific versions. It is also excellent as scraper
(if you're the hands-on type).
Complete raw regex
<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Formatted look
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
[\S\s]*? </ \1 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
| \? [\S\s]*? \?
| (?:
!
(?:
(?: DOCTYPE [\S\s]*? )
| (?: \[CDATA\[ [\S\s]*? \]\] )
| (?: -- [\S\s]*? -- )
| (?: ATTLIST [\S\s]*? )
| (?: ENTITY [\S\s]*? )
| (?: ELEMENT [\S\s]*? )
)
)
)
>