amazon-emr Questions

2

Solved

Is there a way to send EMR logs to CloudWatch instead of S3. We would like to have all our services logs in one location. Seems like the only thing you can do is set up alarms for monitoring but th...

3

Solved

I'm trying to use the graphframes package in pyspark in Jupyter Notebook (using Sagemaker and sparkmagic) on AWS EMR. I've tried adding a configuration option when creating the EMR cluster in the A...
Glassworks asked 4/6, 2019 at 14:47

2

Solved

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very...
Azaleah asked 12/12, 2021 at 8:10

2

I have one EMR cluster which is running 24/7. I can't turn it off and launch the new one. What I would like to do is to perform something like bootstrap action on the already running cluster, pre...
Oralee asked 26/10, 2014 at 17:18

3

I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset u...
Benefic asked 4/9, 2019 at 0:56

3

I'm executing a Flink Job with this tools. I think both can do exactly the same with the proper configuration. Does Kinesis Data Analytics do something that EMR can not do or vice versa? Amazon Ki...

5

Solved

I'm trying to create a cluster from inside one of my EC2 instances. Typing the following command to start my cluster- aws emr create-cluster --release-label emr-5.20.0 --instance-groups instance-g...
Leavis asked 15/3, 2019 at 20:35

2

Solved

I created an EMR cluster on AWS with Spark and Livy. I submitted a custom JAR with some additional libraries (e.g. datasources for custom formats) as a custom JAR step. However, the stuff from the ...
Vociferation asked 19/6, 2019 at 11:18

2

Solved

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided ...
Woolpack asked 21/5, 2012 at 9:58

0

I need to render a HTML from a cell in Jupyter Notebook on EMR cluster. Things that have not worked so far: using IPython display from IPython.core.display import display, HTML example = '<htm...
Calistacalisthenics asked 25/11, 2021 at 15:55

3

According to this question - --files option in pyspark not working the sc.addFiles option should work for accessing files in both the driver and executors. But I cannot get it to work on the execut...
Efficient asked 27/1, 2021 at 15:42

2

I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 r...

1

The pyspark3, pyspark, and spark kearnels in jupyterhub docker on amazon emr do not seem to allow autocomplete of function names or the doc string , shift-tab. Has anyone else noticed this behaviou...
Trimmer asked 9/9, 2018 at 14:10

4

I have a VPC in AWS account and there are 5 subnets associated with that VPC. Subnets are of 2 types - Public and private. How to identify which subnet is public and which is private ? Each subnet ...
Itemize asked 16/2, 2018 at 16:17

2

Solved

I'm trying to use all resources on my EMR cluster. The cluster itself is 4 m4.4xlarge machines (1 driver and 3 workers) with 16 vCore, 64 GiB memory, EBS Storage:128 GiB When launching the cluster ...
Astonish asked 22/9, 2021 at 14:46

2

After launching cluster with the below bootstrap code and getting the below stdout, when I try to import pandas in pyspark, i get the following error due to conflict with a different numpy version ...
Malvina asked 16/7, 2021 at 9:31

2

TLDR - I want to run the command sudo yes | sudo pip3 uninstall numpy twice in EMR bootstrap actions but it runs only once. I will first say that my goal is to run a Pyspark-enabled EMR managed not...
Transposition asked 10/8, 2021 at 9:14

0

How can I make stdout logs appear in the EMR Step tabs. The logs are in the S3 bucket but only the stdout won't show.
Ebon asked 27/8, 2021 at 19:2

1

I'm newly use Spark with PySpark on JupyterHub. I understand that before creating an EMR I can set the bootstrap to setup the environment in each cluster, like Python package/library. But If I alre...
Scorecard asked 22/5, 2020 at 12:9

2

Solved

When I create an AWS EMR Notebook, got the below error. The service role is EMR_Notebook_DefaultRole. Service role does not have permission to access the LocationUri {} What would be the root caus...
Vauntcourier asked 28/1, 2021 at 16:48

4

Solved

I am trying to run a bash script as a step after EMR completes bootstrapping. Following is my terraform code: step { action_on_failure = "CONTINUE" name = "Setup Hadoop configuration" hadoop_jar...
Chloe asked 11/8, 2018 at 21:42

1

Solved

I have containerized ML job code written in python into a docker container and able to run as docker service using Amazon ECS. I would like to run in distributed way using Spark - Pyspark and deplo...
Graybeard asked 25/5, 2017 at 12:5

6

I am running an EMR cluster and trying to use a Zeppelin notebook for data analysis. Versions: Release label:emr-5.2.1 Hadoop distribution: Amazon 2.7.3 Hive 2.1.0 Spark 2.0.2 Zeppelin 0.6.2 I ...

4

Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to...

3

Solved

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ ...
Cigarette asked 7/2, 2017 at 13:51

© 2022 - 2025 — McMap. All rights reserved.