Nutch on EMR problem reading from S3
Asked Answered
E

1

6

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.
To do this I specifiy an input directory from S3. I get the following error:

Fetcher: java.lang.IllegalArgumentException:
    This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
    does not support access to the request path 
    's3n://crawlResults2/segments/20110823155002/crawl_fetch'
    You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(uri, conf), and FileSystem.get(conf). If I were writing this myself I would FileSystem.get(uri, conf) however I am trying to use existing Nutch code.

I asked this question, and someone told me that I needed to modify hadoop-site.xml to include the following properties: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml does not exist), but that didn't make a difference. Does anyone have any other ideas? Thanks for the help.

Elect answered 30/8, 2011 at 1:52 Comment(1)
Never used Nutch, but maybe check if the resource you are trying to get is publicly available (won't harm to do that just for testing), also try replacing (again just for testing) s3n:// -> s3://. I guess it should work with s3n and with creds specified but more tests won't harmGardal
E
0

try to specify in

hadoop-site.xml

<property>
  <name>fs.default.name</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

This will mention to Nutch that by default S3 should be used

Properties

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey

specification you need only in case when your S3 objects are placed under authentication (In S3 object can be accessed to all users, or only by authentication)

Eloquence answered 12/3, 2014 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.