Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)
Asked Answered
W

1

8

I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.

The operations required are:

  1. Remove all tags that have a class of "hidden"
  2. Remove all script tags
  3. Remove all style tags
  4. Remove all event attributes (on*="*")
  5. Remove all style attributes

I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.

Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.

Thanks in advance!

Wiggly answered 2/12, 2011 at 14:30 Comment(0)
A
10

Check out jsoup - it should handle all of your necessary tasks in an elegant way.

[Edit]

Here's a full working example per your required operations:

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.

Atherton answered 2/12, 2011 at 15:16 Comment(2)
I will take a look at jsoup. If it provides a better framework for solving my problem, then I shall submit an answer advocating it's use for my requirements. Thanks for the tip.Wiggly
how to get return string after removing attribute?Balf

© 2022 - 2024 — McMap. All rights reserved.