Nutch on EMR problem reading from S3

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.
To do this I specifiy an input directory from S3. I get the following error:

Fetcher: java.lang.IllegalArgumentException:
    This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
    does not support access to the request path 
    's3n://crawlResults2/segments/20110823155002/crawl_fetch'
    You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(uri, conf), and FileSystem.get(conf). If I were writing this myself I would FileSystem.get(uri, conf) however I am trying to use existing Nutch code.

I asked this question, and someone told me that I needed to modify hadoop-site.xml to include the following properties: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml does not exist), but that didn't make a difference. Does anyone have any other ideas? Thanks for the help.

Recommended topics

Hot tags