How to get the html content from nutch

P

4

6

Is there is any way to get the html content of each webpage in nutch while crawling the web page?

Perilymph answered 25/2, 2011 at 23:16 Comment(0)

S

9

Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

import java.io.File;
import java.io.FileOutputStream;

public class NutchSegmentOutputParser {

public static void main(String[] args) {

    if (args.length != 2) {
        System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir");
        return;
    }

    try {
        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);


        String segment = args[0];

        File outDir = new File(args[1]);
        if (!outDir.exists()) {
            if (outDir.mkdir()) {
                System.out.println("Creating output dir " + outDir.getAbsolutePath());
            }
        }

        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);


        Text key = new Text();
        Content content = new Content();

        while (reader.next(key, content)) {
            String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim();

            File f = new File(outDir.getCanonicalPath() + "/" + filename);
            FileOutputStream fos = new FileOutputStream(f);
            fos.write(content.getContent());
            fos.close();
            System.out.println(f.getAbsolutePath());
        }
        reader.close();
        fs.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

}

}

I recommend using Maven; add the following dependencies:

     <dependency>
      <groupId>org.apache.nutch</groupId>
        <artifactId>nutch</artifactId>
        <version>1.5.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>0.23.1</version>
    </dependency>

and create a jar package (i.e. NutchSegmentOutputParser.jar)

You need Hadoop to be installed on your machine. Then run:

$/hadoop-dir/bin/hadoop --config \
NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \
NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir

where nutch-crawled-dir/2012xxxxxxxxx/ is the crawled directory you want to extract content from (it contains 'segment' subdirectory) and outdir is an output dir. The output file names are generated from URI, however, the slashes are replaced by "_".

Hope it helps.

Supertanker answered 24/10, 2012 at 6:56 Comment(1)

Great answer. I've created the SequenceFileReader for Spring Batch, – Painting 6/5, 2015 at 10:52

E

1

Try this:

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags
 metaTags, DocumentFragment doc) 
{
 Parse parse = parseResult.get(content.getUrl());
 LOG.info("parse.getText: " +parse.getText());
 return parseResult;
}

Then check the content in hadoop.log.

Estray answered 25/1, 2012 at 10:44 Comment(0)

D

0

Its super basic.

public ParseResult getParse(Content content) {
   LOG.info("getContent: " + new String(content.getContent()));

The Content object has a method getContent(), which returns a byte array. Just have Java create a new String() with the BA, and you've got the raw html of whatever nutch had fetched.

I'm using Nutch 1.9

Here's the JavaDoc on org.apache.nutch.protocol.Content https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/protocol/Content.html#getContent()

Dioscuri answered 24/3, 2015 at 17:36 Comment(0)

S

-3

Yes there is a way. Have a look at cache.jsp to see how it displays the cached data.

Selffertilization answered 8/3, 2011 at 17:19 Comment(0)

Recommended topics

Hot tags