Where is the crawled data stored when running nutch crawler? - McMap

About

Where is the crawled data stored when running nutch crawler?

Asked 30/3, 2015 at 9:43 Answered 3/4, 2015 at 5:14

Solved web-crawler nutch

S

1

6

I am new to Nutch. I need to crawl the web (say, a few hundred web pages), read the crawled data and do some analysis.

I followed the link https://wiki.apache.org/nutch/NutchTutorial (and integrated Solr since I may require to search text in future) and ran the crawl using a few URLs as the seed.

Now, I don't find the text/html data in my local machine. Where can I find the data and what is the best way to read the data in text format?

Versions

apache-nutch-1.9
solr-4.10.4

Saltcellar answered 30/3, 2015 at 9:43 Comment(0)

M

9

After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.

The usage is as follows :

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

So for example you could do something like

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

This would create a new dir at the -outputDir location and dump all the pages crawled in html format.

There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions

Mauchi answered 3/4, 2015 at 5:14 Comment(2)

Thanks for the info. I did it in a different way. There is a file namely 'data' inside "segments/2*************/content/part-00000" folder, and it is a sequential file. I wrote a java program to convert it into text. Your answer is pretty straightforward and very informative. – Saltcellar 6/4, 2015 at 5:21

how to get for nutch2 – Dunite 22/2, 2018 at 6:13

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.