I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling).
I read this material but its a little hard to understand.
Is there anyone who can explain this to me in a complete and easy-to-understand way?
thanks in advance.
Short Answer
In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer).
You can start crawling the web without actually needing to know how they implemented it.
The page you reference describes how it is implemented.
Technology behind it
They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web.
I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites).
Take a look at the wikipedia page for MapReduce.
The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task, and then sends the sub-result back to Master.
Once the Master Node gets all the sub-results (or some of the sub-results) it starts to combine them (reduce them) into the final answer.
All of these tasks are done at the same time, and each computer is given the right amount of work to keep it occupied the whole time.
How to Crawl
Consists of 4 jobs:
- Generate
- Fetch
- Parse
- Update Database
*Generate
Start with a list of webpages containing the pages you want to start crawling from: The "Webtable".
The Master node sends all of the pages in that list to its slaves (but if two pages have the same domain they are sent to the same slave).
The Slave takes its assigned webpage(s) and:
- Has this already been generated? If so, skip it.
- Normalize the URL since "http://www.google.com/" and "http://www.google.com/../" is actually the same webpage.
- return an initial score along with the webpage back to the Master.
(the Master partitions the webpages when it sends it to its slaves so that they all finish at the same time)
The Master now chooses the topN (maybe the user just wanted to start with 10 initial pages), and marks them as chosen in the webtable.
*Fetch
Master looks at each URL in the webtable, maps the ones which were marked onto slaves to process them.
Slaves fetch each URL from the Internet as fast as the internet connection will let them, they have a queue for each domain.
They return the URL along with the HTML text of the webpage to the Master.
*Parse
Master looks at each webpage in the webtable, if it is marked as fetched, it sends it to its slaves to parse it.
The slave first checks to see if it was already parsed by a different slave, if so skips it.
Otherwise, it parses the webpage and saves the result to webtable.
*Update Database
Master looks at each webpage in the webtable, sends the parsed rows to its slaves.
The slaves receive these Parsed URLs, calculate a score for them based on the number of links away from those pages (and the text near those links), and sends the Urls and scores back to the Master (which is sorted by score when it gets back to the Master because of the Partitioning).
The master calculates and updates the webpage scores based on the number of links to those pages from other ones.
The master stores this all to the database.
Repeat
When the pages were parsed, the links out of those webpages were added into the webtable. You can now repeat this process on just pages you haven't looked at yet to keep expanding your visited pages. Eventually you will reach most of the Internet after enough iterations of the four above steps.
Conclusion
MapReduce is a cool system.
A lot of effort has been applied to make it as efficient as possible.
They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.
The Master may decide to do the slaves' tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.
MapReduce lets you write simple code:
Define a Mapper, an optional Partitioner, and a Reducer.
Then let MapReduce figure out how best to do that with all the computer resources it has access to, even if it is a single computer with a slow internet connection, or a kila-cluster. (maybe even Mega-clusters).
© 2022 - 2024 — McMap. All rights reserved.