get out links from nutch

Asked 15/9, 2011 at 2:13 Answered 26/12, 2013 at 12:25

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.

I get list of urls crawled using readdb command.

bin/nutch readdb crawl/crawldb -dump file

Is there a way to find out urls that are on a page by reading crawldb or linkdb ?

in the org.apache.nutch.parse.html.HtmlParser I see outlinks array, I am wondering if there is a quick way to access it from command line.

Acceleration answered 15/9, 2011 at 2:13 Comment(1)

To be precise you mean finding the outlinks of a given page. I don't know that you can do that from the command line. You should be able in writing & map/reduce job... not that difficult as I found out. – Leontina 16/9, 2011 at 12:22

From command line, you can see the outlinks by using readseg with -dump or -get option. For example,

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump

Acceleration answered 20/9, 2011 at 16:40 Comment(0)

You can easily do this with readlinkdb command. It gives you all the inlinks and outlinks to and from a url.

bin/nutch readlinkdb <linkdb> (-dump <out_dir> | -url <url>)

linkdb: This is the linkdb directory we wish to read and obtain information from.

out_dir: This parameter dumps the whole linkdb to a text file in any out_dir we wish to specify.

url: The -url arguement provides us with information about a specific url. This is written to System.out.

e.g. 

bin/nutch readlinkdb crawl/linkdb -dump myoutput/out1

For more information refer to http://wiki.apache.org/nutch/bin/nutch%20readlinkdb

Aidoneus answered 26/12, 2013 at 12:25 Comment(0)

Recommended topics

Hot tags