elastic-map-reduce Questions
3
Solved
Everything works fine locally when I do as follows:
cat input | python mapper.py | sort | python reducer.py
However, when I run the streaming MapReduce job on AWS Elastic Mapreduce, the job does...
Toney asked 26/3, 2012 at 23:15
1
I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.
How do I know if I need more than 1 core i...
Uella asked 29/4, 2014 at 9:29
2
Solved
I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it...
Backhand asked 9/10, 2011 at 5:42
1
Solved
So Im trying to query my hbase cluster on Amazon ec2 using a custom jar i launch as a MapReduce step. Im my jar (inside the map function) I call Hbase as so:
public void map( Text key, BytesWritab...
Semifinal asked 28/2, 2014 at 20:22
5
Solved
I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do ...
Purvey asked 9/1, 2013 at 11:3
2
I have a huge DynamoDB table that I want to analyze to aggregate data that is stored in its attributes. The aggregated data should then be processed by a Java application.
While I understand the re...
Embroidery asked 8/4, 2012 at 23:5
2
I'm very new to amazon services. I'm facing problems in creating job flows. Every time i create any job flow it fails or shuts down. Input, output or mapper function upload techniques are not clear...
Cuckoopint asked 22/1, 2013 at 11:57
2
Solved
i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and ...
Occlusive asked 9/7, 2013 at 21:24
1
Solved
I am getting an error "No space left on device" when I am running my Amazon EMR jobs using m1.large as the instance type for the hadoop instances to be created by the jobflow. The job generates app...
Reinhardt asked 24/10, 2013 at 9:7
1
Solved
I have a mapper and reducer that work fine when I run them in the piped version:
cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py
I used the elastic mapreducer wizard, loaded inputs, outpu...
Stoffel asked 1/9, 2013 at 7:34
1
Solved
I've loaded tab separated files into S3 that with this type of folders under the bucket:
bucket --> se --> y=2013 --> m=07 --> d=14 --> h=00
each subfolder has 1 file that represent on hour of my ...
Contributory asked 14/7, 2013 at 13:33
2
Solved
My map function produces a
Key\tValue
Value = List(value1, value2, value3)
then my reduce function produces:
Key\tCSV-Line
Ex.
2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s,
2323555...
Strive asked 26/6, 2013 at 23:38
2
We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from th...
Lorenzoloresz asked 29/8, 2012 at 12:0
2
Solved
I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception.
It is possible that the...
Vander asked 21/11, 2012 at 13:38
2
I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hado...
Villous asked 13/6, 2013 at 1:5
1
Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple fil...
Lapierre asked 8/5, 2013 at 20:28
2
Solved
I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 input...
Myogenic asked 15/8, 2012 at 13:59
1
I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts.
Algorithm:( For calculating the overlap between record pairs)
Given a user defined paramet...
Tucson asked 10/3, 2013 at 6:5
3
I am using EMR to analyze web nginx logs. But I need to process the logs so that it can fall into rows and columns in order to make it easy for querying. Thus i made two tables - rawlog, processedl...
Beefsteak asked 8/6, 2012 at 10:21
1
Solved
I just found that using Amazon's Elastic Map Reduce, I can specify a step to have one of three ActionOnFailure choices:
TERMINATE_JOB_FLOW
CANCEL_AND_WAIT
CONTINUE
TERMINATE_JOB_FLOW is the def...
Gratia asked 7/3, 2013 at 21:19
2
Solved
The reduce phase of the job fails with:
of failed Reduce Tasks exceeded allowed limit.
The reason why each task fails is:
Task attempt_201301251556_1637_r_000005_0 failed to report status for 60...
Adina asked 7/3, 2013 at 20:42
3
From Amazon's EMR FAQ:
Q: Can I load my data from the internet or somewhere other than Amazon S3?
Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS ser...
Promote asked 6/6, 2012 at 16:41
1
Solved
SOLVED: See Update #2 below for the 'solution' to this issue.
~~~~~~~
In s3, I have some log*.gz files stored in a nested directory structure like:
s3://($BUCKET)/y=2012/m=11/d=09/H=10/
I'm at...
Pina asked 10/11, 2012 at 3:53
2
Solved
I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar.
We can use the following way ...
Litotes asked 14/2, 2012 at 20:45
2
Solved
I have to process some data which is persisted in Amazon Dynamo DB using Hadoop map reduce.
I was searching over internet for Hadoop InputFormat for Dynamo DB and couldn't find it. I'm not famili...
Marlea asked 22/10, 2012 at 21:22
© 2022 - 2024 — McMap. All rights reserved.