nutch vs solr indexing

Asked 1/6, 2012 at 5:18 Answered 31/10, 2012 at 4:10

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when I read documentation on nutch, it says that nutch also does inverted indexing. Does it uses Lucene internally to do indexing or does it have some other library for indexing? If it uses solr/lucene for indexing then why is it necessary to configure solr with nutch as the nutch tutorial says?

Is the indexing done by default. I mean I run this command to start crawling. Is indexing happening here?

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Or does indexing happen only in this case. (According to tutorial: If you have a Solr core already set up and wish to index to it, you are required to add the -solr parameter to your crawl command e.g.)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Adkisson answered 1/6, 2012 at 5:18 Comment(0)

Having a look here might be useful. When you run the first command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

you're crawling, which means that nutch will create its own internal data, composed of:

the crawldb
the linkdb
a set of segments

you can see them in the following directories, which are created while you run the crawl command:

crawl/crawldb
crawl/linkdb
crawl/segments

You can think of that data as some kind of database where nutch stores crawled data. That doesn't have anything to do with an inverted index.

After the crawl process you can index your data on a Solr instance. You can crawl and then index running a single command, which is the second command from your question:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Otherwise you can run a second command after the crawl command, specific for indexing to Solr, but you have to provide the path of your crawldb, linkdb and segments:

bin/nutch solrindex http://localhost:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/*

Neuman answered 1/6, 2012 at 9:38 Comment(4)

Thanks for the response. Just rephrasing it once, nutch just crawls and stores data in db and it does not do indexing by itself. Solr is needed to index. Am I right? I had another doubt regarding adding new fields to index. I wrote a sample plugin to add new field by following tutorials in apache nutch website. Will this plugin be automatically picked up when I start crawling or the plugin needs to be started separately. I followed all steps and just started crawl. I did not see any errors but I did not see any new field inserted. I checked in crawldb and also in segments. – Adkisson 1/6, 2012 at 9:50

@Adkisson You're right with your rephrasing, correct! Regarding the plugin I don't know, I;ve never worked with nutch plugins, but maybe a new question with some more details (and code) would help. – Neuman 1/6, 2012 at 11:25

Thanks again. I will do the same. One final related question. When Solr is used to index data crawled by nutch, all the indexes are saved in db of nutch or in db of Solr? If Solr, then do you know which directory it will be saved under. – Adkisson 1/6, 2012 at 12:40

@Adkisson All crawled data are normally stored into those nutch data directories. If you then index in Solr you index some of those information (crawled pages, metadata and content) within Solr too. – Neuman 1/6, 2012 at 13:31

You may be getting confused by legacy Nutch versions and associated online documentation. Originally it created its own index and had its own web search interface. Using Solr became an option requiring extra configuration and fiddling. Starting with 1.3 the indexing and server parts were stripped out and now it's assumed Nutch will be using Solr.

Unbiased answered 31/10, 2012 at 4:10 Comment(0)

Recommended topics

Hot tags