I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.
The operations required are:
- Remove all tags that have a class of "hidden"
- Remove all script tags
- Remove all style tags
- Remove all event attributes (on*="*")
- Remove all style attributes
I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.
Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.
Thanks in advance!