How to parse content located in specific HTML tags using nutch plugin?
Asked Answered
P

1

7

I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,

  <h><title> title to search </title></h>
   <div id="abc">
        content to search
   </div>
   <div class="efg">
        other content to search
   </div>

I want to parse div element with id ="abc" and class="efg" and so on.

I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.

I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.

Probationer answered 31/7, 2013 at 14:2 Comment(0)
A
6

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it in this way:

<config>
    <fields>
        <field name="custom_content" />
    </fields>
    <documents>
        <document url=".+" engine="css">
            <extract-to field="custom_content">
                <text>
                    <expr value="#abc" />
                </text>
                <text>
                    <expr value=".efg" />
                </text>
            </extract-to>
        </document>
    </documents>
</config>
Amalgam answered 18/12, 2013 at 12:8 Comment(2)
When I tried the above example in 'extractors.xml' then Nutch wont index into Solr. If works if I remove ANY ONE <text> element. the plugin wont accept multiple <text> elements ?Schermerhorn
This plugin does not work for Nutch newest versions, i.e. 2.X versionsLibove

© 2022 - 2024 — McMap. All rights reserved.