Did you try GettingNutchRunningWithWindows from the Nutch Wiki?
Some of my students experimented a lot and here is the result of their work:
Tested with nutch 1.7 - http://www.apache.org/dyn/closer.cgi/nutch/1.7/apache-nutch-1.7-bin.zip
You'll also need cygwin.
1) Extract nutch to path without spaces. For example:
2) Copy jdk to some place without spaces. I attempted to make a symlink inside cygwin instead, but it did not go well. For example
xcopy /S "C:\Program Files\Java\jdk1.7.0_21" c:\jdk1.7.0_21
3) In cygwin setup the paths to java
3.1) export JAVA_HOME=/cygdrive/c/jdk1.7.0_21
3.2) export PATH=$JAVA_HOME/bin:$PATH
3,3) Check that all is correct by calling which java. Should return /cygdrive/c/jdk1.7.0_21/bin/java
SO FAR - fixed the first problem - with incorrect java paths. Now to the second problem - hadoop patching.
4) Patch hadoop
In short:
- put patch-hadoop_7682-1.0.x-win.jar
in d:\dev\ir\nutch-1.7\lib
- edit d:\dev\ir\nutch-1.7\conf\nutch-site.xml
by adding the following:
<description>Enables patch for issue HADOOP-7682 on Windows</description>
5) Hadoop temp dir - I am not sure if this is necessary (try before applying it), because I added it before applying the patch, but in my d:\dev\ir\nutch-1.7\conf\nutch-site.xml
I have
6) Hadoop version -I am not sure if this is necessary (try before applying it), I downgraded hadoop to hadoop-core- I found the patch and it still stays this on my setup.
If you find this necessary it is here: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/
6.1) Move hadoop-core-1.2.1.jar
from d:\dev\ir\nutch-1.7\lib
to some location for backup
6.2) Download hadoop-core-
to d:\dev\ir\nutch-1.7\lib
7) Some crawling optimizations. If you need to crawl lots of sites, don't start crawling with a huge list of urls, and big depth and topN.
If you do, you'd see that nutch fetches links one at a time from the same site sequentially, waiting a 5 seconds between fetches.
The reason is that depth 30 and topN 200 will most possibly fill the first fetch queue only with links from the same site. Nutch won't try to fetch them at once, because by default it is configured not to fetch in several threads from the same site. So you are doomed to wait. A lot.
7.1) To resolve this, first run several crawls with small depth and topN - e.g.
bin/nutch crawl urls -dir crawl -depth 3 -topN 4
This will fill the generated fetch queue with urls from more than one site
7.2) Then you can try a big night's crawl with
bin/nutch crawl urls -dir crawl -depth 20 -topN 150
7.3.) To allow for some multi-threading add the following to yours nutch-site.xml
. It will allow several threads fetch from the same host at once.
NOTE! Read the meaning of the properties in internet before using them.
<description>applicable ONLY if fetcher.threads.per.host is greater than 1 (i.e. the host blocking is turned off).</description>
Note: When you crawl lots of sites, make sure your D:\Dev\id\apache-nutch-1.7\conf\regex-urlfilter.txt
includes only the sites in which you are interested. Otherwise you'll end up with "The Internet" on your disk.