what is going on inside of Nutch 2?
Asked Answered
P

1

5

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling).
I read this material but its a little hard to understand.
Is there anyone who can explain this to me in a complete and easy-to-understand way?
thanks in advance.

Pelt answered 27/7, 2012 at 22:22 Comment(0)
M
18

Short Answer

In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer).

You can start crawling the web without actually needing to know how they implemented it.

The page you reference describes how it is implemented.

Technology behind it

They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web.

I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites).

Take a look at the wikipedia page for MapReduce.

The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task, and then sends the sub-result back to Master.

Once the Master Node gets all the sub-results (or some of the sub-results) it starts to combine them (reduce them) into the final answer.

All of these tasks are done at the same time, and each computer is given the right amount of work to keep it occupied the whole time.

How to Crawl

Consists of 4 jobs:

  1. Generate
  2. Fetch
  3. Parse
  4. Update Database

*Generate

Start with a list of webpages containing the pages you want to start crawling from: The "Webtable".

The Master node sends all of the pages in that list to its slaves (but if two pages have the same domain they are sent to the same slave).

The Slave takes its assigned webpage(s) and:

  1. Has this already been generated? If so, skip it.
  2. Normalize the URL since "http://www.google.com/" and "http://www.google.com/../" is actually the same webpage.
  3. return an initial score along with the webpage back to the Master.

(the Master partitions the webpages when it sends it to its slaves so that they all finish at the same time)

The Master now chooses the topN (maybe the user just wanted to start with 10 initial pages), and marks them as chosen in the webtable.

*Fetch

Master looks at each URL in the webtable, maps the ones which were marked onto slaves to process them.

Slaves fetch each URL from the Internet as fast as the internet connection will let them, they have a queue for each domain.

They return the URL along with the HTML text of the webpage to the Master.

*Parse

Master looks at each webpage in the webtable, if it is marked as fetched, it sends it to its slaves to parse it.

The slave first checks to see if it was already parsed by a different slave, if so skips it.

Otherwise, it parses the webpage and saves the result to webtable.

*Update Database

Master looks at each webpage in the webtable, sends the parsed rows to its slaves.

The slaves receive these Parsed URLs, calculate a score for them based on the number of links away from those pages (and the text near those links), and sends the Urls and scores back to the Master (which is sorted by score when it gets back to the Master because of the Partitioning).

The master calculates and updates the webpage scores based on the number of links to those pages from other ones.

The master stores this all to the database.

Repeat

When the pages were parsed, the links out of those webpages were added into the webtable. You can now repeat this process on just pages you haven't looked at yet to keep expanding your visited pages. Eventually you will reach most of the Internet after enough iterations of the four above steps.

Conclusion

MapReduce is a cool system.

A lot of effort has been applied to make it as efficient as possible.

They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.

The Master may decide to do the slaves' tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.

MapReduce lets you write simple code:

Define a Mapper, an optional Partitioner, and a Reducer.

Then let MapReduce figure out how best to do that with all the computer resources it has access to, even if it is a single computer with a slow internet connection, or a kila-cluster. (maybe even Mega-clusters).

Mouldy answered 28/7, 2012 at 6:24 Comment(2)
thanks,really helpful.but one more thing.Can you explain more about "score" mentioned in "Update Database" section?Does it use tf*idf algorithm?(I don't think so)what is the algorithm behind it?Pelt
@Pelt I'm not exactly sure what they use to determine the score, the document you linked doesn't describe it, maybe in a different documentation. However, you can probably insert your own scoring code into the crawler. It does initially score the URL in the generate step. Next it scores it by the text in the anchor tags and the URLs the page links to, lastly by the URLs which link to that page. You could feasibly score by anything. The scoring metric Google uses is PageRank.Mouldy

© 2022 - 2024 — McMap. All rights reserved.