How to parse content located in specific HTML tags using nutch plugin?

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,

  <h><title> title to search </title></h>
   <div id="abc">
        content to search
   </div>
   <div class="efg">
        other content to search
   </div>

I want to parse div element with id ="abc" and class="efg" and so on.

I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.

I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.

<config> <fields> <field name="custom_content" /> </fields> <documents> <document url=".+" engine="css"> <extract-to field="custom_content"> <text> <expr value="#abc" /> </text> <text> <expr value=".efg" /> </text> </extract-to> </document> </documents> </config>

Recommended topics

Hot tags