Apache Nutch steps explaination

Asked 12/4, 2015 at 12:21 Answered 24/6, 2018 at 10:36

I have followed article: https://wiki.apache.org/nutch/NutchTutorial and set up apache nutch +solr. But i want to clarify if i understood correct about working of nutch steps.

1). Inject: In this part, apache reads url list from given seed.txt, compare urls with regex-urlfiler regex and update crawldb with supported urls.

2). Generate: bin/nutch generate crawl/crawldb crawl/segments Nutch takes URLs from crawldb and create fetch list of URLs which are ready to be fetched. it takes input like -topN and timegap etc then create directory with current time under segments.

I believe, In first two steps there was no interaction with internet. Everything was happening locally.

Q: Where is fetch list kept ?

3). Fetch: bin/nutch fetch crawl/segments/

Fetch run fetchList and fetch contents (and URLs) from given URLs and keep it somewhere.

Q: Does fetch read the whole given page of URL (Text + another URLs)? Q: Where Nutch keeps fetched data ?

4). Parse: bin/nutch parse crawl/segments/

It parses the entries.

Q: What is meant by parse here ? Q: Where i can find result of this step ?

5). bin/nutch updatedb crawl/crawldb crawl/segments/

When this is complete, Nutch update the database with the results of the fetch.

Q: Does it update crawldb with parsed data only or something else also?

Please clear my doubts.

Leesaleese answered 12/4, 2015 at 12:21 Comment(0)

Your assumption for the first and second steps are correct. However, you need to understand how the whole workflow takes place. When Nutch fetches urls, it fetches data like web page data or images as binary and stored them into segements as crawl data using a class named Content.

Later, in the parsing step, the stored Content objects are parsed into another data format called ParsedData that includes text of the data plus its outlinks if avaiable. The ParsedData are put back to segements to be processed in the next job batch. After this step comes the crawldb update job, here the links from the previous step are put back into the crawldb to update the page rank and web links details.

At the indexing step, the information from parsed data at segments are structured into fields. Nutch uses a classed named "NutchDocument" to store the structured data, The nutch documents are put back into segments to be processed in the next step. Lastly, Nutch sends Nutch documents to indexing storage like Solr or Elasticsearch. This is the last step, at this stage you can remove the segments if you do not want to send them again to indexing storage. In another words, this is the follow of data

seed list -> inject urls -> crawl item (simply the urls) -> Contents-> parsed data -> nutch documents.

I hope that answers some of your questions.

Bannon answered 12/4, 2015 at 14:58 Comment(3)

Thanks for the answer. It really helped me in understanding. Can you point me to location for all outputs crawlItem, Contents, ParsedData and NutchDocument ? – Leesaleese 12/4, 2015 at 19:16

These classes are under Nutch core source. Nevertheless, Nutch uses hadoop RecordReader and Writer to read and write these data formats. You need to read nutch source code to understand how to read these materials. However, you could easily use bin/nutch commands to explore you stored data from each step. – Bannon 14/4, 2015 at 8:7

thanks @ameertawfik. I want to understand hadoop better. can you guide me to right Book, blog or any tutorial ? – Leesaleese 20/7, 2015 at 18:7

Your answer for the first two steps of inject and generate is write.

I am answering your questions step by step : A fetch-list is a list of urls to be fetched in this iteration of crawling. You can configure the size of a fetch-list using the generate.max.count property. The generated fetch-list is stored inside the crawl_generate directory inside its corresponding segment. You won't, however, be able to read it as it is in binary.

After the generate step comes fetch step where the urls in the fetch-list are fetched from the web. They are stored in the crawl_fetch directory of the segment.

Once the urls are fetched, the content is parsed to get the outlinks, content, metadata, etc. The output of the parsing step is inside crawl_parse, parse_data, parse_text directories of the segment.

Once parsing is complete, we update the crawldb with the newly found links from the recently fetched urls. Crawldb only contains urls and information corresponding to them, such as fetch_status, score, modified_time, etc. You can think of it like a database that stores information about urls.

When a new iteration of crawl starts, the newly added urls from crawldb are inserted into the fetch-list and the process continues.

Shawn answered 24/6, 2018 at 10:36 Comment(0)

Recommended topics

Hot tags