How to recrawle nutch

Asked 14/12, 2012 at 6:21 Answered 17/10, 2013 at 8:29

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.

Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.

Can anybody tell me, what actually I am doing wrong.

Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.

Any help will really appreciable.

Enjoin answered 14/12, 2012 at 6:21 Comment(0)

I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.

First time when I start nutch I do the following:

Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:

# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:

# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/

... and everything was fine.

Next I made the following changes:

Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:

# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:

# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/

Next I execute the following commands:

updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3

And nutch still crawl the www.domain01.com

I don't know why ?

I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).

Ufa answered 4/2, 2013 at 14:57 Comment(1)

I solved the problem. In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and in /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt I just remove the spaces before domains. before: # accept nything else ^http://([a-z0-9]*.)www.domain02.com/sport/ ^http://([a-z0-9].)*www.domain03.com/sport/ after: # accept anything else ^http://([a-z0-9]*.)www.domain02.com/sport/ ^http://([a-z0-9].)*www.domain03.com/sport/ Now nutch crawl the new URLs. – Ufa 5/2, 2013 at 13:28

This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)

Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.

Cheers.

Escutcheon answered 24/12, 2012 at 2:36 Comment(0)

u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........

<property> <name>file.crawl.parent</name> <value>false</value> </property

and u just change regex-urlfilter.txt

# skip file: ftp: and mailto: urls #-^(file|ftp|mailto):
# accept anything else +.

after remove that indexing dir manual or command also like.. rm -r $NUTCH_HOME/indexdir

after run ur crawl cammand...........

Hiett answered 17/10, 2013 at 8:29 Comment(0)

Recommended topics

Hot tags