Web Cralwer Algorithm: depth?

I

7

4

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think.

Interpreter answered 4/12, 2010 at 23:54 Comment(1)

you said a site could be represented as a binary tree, but I think that it could be represented as a graph, since links may link to each other more than one time and a cross each other. You may even have dead end links that never link to any other pages but only to it self. So we can consider the web site or even the internet as a graph I think. – Platitudinize 28/11, 2012 at 9:34

S

12

Link depth means the number of "hops" a page is be away from the root, where a "hop" means following a link on a page. The reason Nutch has this restriction is that links very "far away" from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn't use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it's a variant of depth-first search (DFS), so it's quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it's a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.

Setter answered 5/12, 2010 at 0:2 Comment(2)

I'll try depth limited search. In your pseudo-code, how is the bound of 3 enforced? – Interpreter 5/12, 2010 at 22:19

By the recursion: you call it with a second argument of 3. d = 0, the base case, returns immediately; in the recursive call, d is decremented. – Setter 6/12, 2010 at 10:40

S

2

I guess the "depth" is the number of times the crawler "follows a link".

Say you start from the root page. You follow each of the links on this page: this is depth 1. For each of the target pages, you follow the links: this is depth 2, etc.

Note that there may be "cycles" while following links. The structure is not a tree, but a graph.

Savaii answered 4/12, 2010 at 23:58 Comment(0)

S

2

Make a list that you use as a queue.
Append www.domain.com, depth 0 to it
Pull the first element off it
current depth is the elements depth+1
Crawl that site
Append each link on the site to the queue if the current depth isn't greater than the maximum depth
If the list isn't empty, go back to 3..

Selective answered 4/12, 2010 at 23:59 Comment(1)

Do you mean a FIFO queue? This is depth-limited search, so it can be done much more efficiently with a stack. – Setter 5/12, 2010 at 0:8

D

0

Link depth means how many links you have to follow before you reach a given link.

Example: example.com links to example.com/foo.html which links to google.com. Therefore google.com has a link depth of 2 relative to example.com as you can reach it following 2 links.

To crawl example.com to a depth of 3, you would follow links to a maximum depth of 3 and then stop following links. Without this restriction, you could easily go on forever.

Example: example.com links to example.com/foo.html and example.com/bar.html. You follow those two links, the links they link to and stop there.

Note: The root page has a depth of 0.

Dulcimer answered 4/12, 2010 at 23:56 Comment(0)

H

0

A web site's root is at depth 0. Documents you can reach by using links in the root are at depth 1. Documents you can in turn reach from links in documents at depth 1 would be at depth 2. And so on.

Depending on your crawler this might apply to only documents in the same site/domain (usual) or documents hosted elsewhere.

Most web sites are not representable by a binary tree, as the "root" might have more than two "nodes".

Hesky answered 4/12, 2010 at 23:58 Comment(0)

R

0

Well in the case of Nutch, the depth argument is quite a misnommer, it just means the number of loops the crawler is going through. So you will reach pages which are 3 links away from your seed urls... on a given site it might in depth 3... that is if they hand up being within the top N limits.

Ronni answered 5/12, 2010 at 19:50 Comment(0)

Q

0

Depth is the number of slashes it the url path

example http://www.google.com/foo/bar/baz has a depth of 3

def depth(self,link):
    return len(urlparse.urlparse(url).path.split("/")) -1

Quittance answered 8/4, 2014 at 5:54 Comment(1)

This is not true. Larsmans gives the good answer. if google.com links to google.com/service/contact/phone than this will be depth 1 and not 3 as you say it. – Boulogne 30/11, 2015 at 11:57

Recommended topics

Hot tags