elastic-map-reduce Questions

10

Solved

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce, I have a general understanding of how thi...
Leonaleonanie asked 29/11, 2012 at 16:49

3

1) I have been told that git comes stock installed on EMR. Is this true ? I believe not, as I can confirm that "git" is not found in my elastic-mapreduce ssh terminal. See: https://raw.github.com/g...
Podiatry asked 25/7, 2012 at 15:59

1

I'm running a large (more than 100 nodes) series of mapreduce jobs on Amazon Elastic MapReduce. In the reduce phase, already-completed map tasks keep failing with Map output lost, rescheduling: g...

2

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. M...

4

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i n...
Coacervate asked 31/1, 2013 at 17:0

2

Solved

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly ha...
Horrorstruck asked 26/11, 2015 at 14:16

3

I have integrated ELK with Pyspark. saved RDD as ELK data on local file system rdd.saveAsTextFile("/tmp/ELKdata") logData = sc.textFile('/tmp/ELKdata/*') errors = logData.filter(lambda line: "r...
Lunetta asked 19/1, 2016 at 6:19

3

Solved

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1. Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want...

0

I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP, etc. The c...

2

Solved

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster. Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore. Some ...
Glassworker asked 8/9, 2016 at 11:38

7

Solved

I have a website running on AWS EC2. I need to create a nightly job that generates a sitemap file and uploads the files to the various browsers. I'm looking for a utility on AWS that allows this fu...

1

Total Instances: I have created an EMR with 11 nodes total (1 master instance, 10 core instances). job submission: spark-submit myApplication.py graph of containers: Next, I've got these gra...
Subcartilaginous asked 22/1, 2017 at 1:8

3

Solved

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option. I have tried two solutions: org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException. com.amaz...
Gyatt asked 12/6, 2014 at 12:43

5

Solved

How can I drop all partitions currently loaded in a Hive table? I can drop a single partition with alter table <table> drop partition(a=, b=...); I can load all partitions with the recover ...
Cheery asked 19/3, 2013 at 5:52

4

Solved

There must be a way to change the ports 50070 and 50030 so that the following urls display the clustr statuses on the ports i pick NameNode - http://localhost:50070/ JobTracker - http://localhost:...
Sadick asked 16/11, 2012 at 19:1

1

Solved

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.sch...
Millican asked 7/1, 2016 at 22:31

3

I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum c...
Lovesome asked 30/11, 2015 at 16:51

3

Solved

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMIT...
Tremulant asked 28/2, 2012 at 20:48

2

How to mute DEBUG messages on AWS Elastic MapReduce Master node? hbase(main):003:0> list TABLE mydb 1 row(s) in 0.0510 seconds hbase(main):004:0> 00:25:17.104 [main-SendThread(ip-172-31-1...

7

Solved

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails: Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsEx...
Enphytotic asked 28/5, 2013 at 16:47

2

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class. In my hadoop process, this is how I connect to cassandra, after including cassand...
Keeper asked 26/11, 2012 at 14:33

3

Recently I've been working with Amazon Web Services (AWS) and I've noticed there is not much documentation on the subject, so I added my solution. I was writing an application using Amazon Elastic...
Electrostriction asked 25/5, 2012 at 16:47

3

Solved

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.: ... s3://mylogfiles/2014-08-11-00111.lzo s3://mylogfiles/201...
Gebhardt asked 11/8, 2014 at 16:37

2

Solved

Main question: How do I combine different randomForests in python and scikit-learn? I am currently using the randomForest package in R to generate randomforest objects using elastic map reduce. Th...
Boomerang asked 18/9, 2014 at 13:39

1

Solved

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster. I am creating the cluster using something like : ./elastic-mapreduce --create --alive \ --name "ll_Sp...
Caryloncaryn asked 21/8, 2014 at 7:43

© 2022 - 2024 — McMap. All rights reserved.