Configuring external data source for Elastic MapReduce

Asked 29/8, 2012 at 12:0 Answered 24/6, 2013 at 5:46

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible: Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?

However, when creating a new job flow, we can only configure a S3 bucket as input data origin.

Any ideas/samples on how to do this?

Thanks!

P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.

Lorenzoloresz answered 29/8, 2012 at 12:0 Comment(0)

How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.

If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.

If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.

Allusion answered 24/6, 2013 at 5:46 Comment(0)

Try using scp to copy files to your EMR instance:

    my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file

(or use ftp, or wget, or curl, or anything else you want)

then log into your EMR instance with ssh and load it into hadoop:

    my-desktop-box$ ssh my-emr-node
    my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file

Handbreadth answered 27/3, 2013 at 5:53 Comment(0)