amazon-emr Questions
7
Solved
If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the bett...
Circassia asked 7/6, 2020 at 20:19
4
On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...
Wreckful asked 10/1, 2020 at 1:13
4
Using Hive 2.3.7 on AWS EMR (5.33.1) I have created a database which shows correctly when calling show databases;. I then create a table which seems to work correctly (no exceptions). When I call d...
Inwrap asked 2/12, 2021 at 10:35
4
I'm facing a problem running Jobs on an Amazon EMR when I try to write data on S3.
This is the stacktrace:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local dire...
Nd asked 13/10, 2020 at 20:14
5
Solved
I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each w...
Beaty asked 8/11, 2016 at 17:26
6
I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error.
[hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow
Collecting p...
Willman asked 5/9, 2018 at 9:12
4
Solved
I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...
Cuckoopint asked 2/12, 2016 at 7:52
2
Solved
I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the...
Cloudless asked 16/6, 2016 at 1:1
3
I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB.
Read the table into a DataFrame
Right now, I can read the table from DynamoDB into Spark a...
Bertold asked 8/12, 2017 at 21:48
3
Solved
I use EMR Notebook connected to EMR cluster. Kernel is Spark and language is Scala. I need some jars that are located in S3 bucket.
How can I add jars?
In case of 'spark-shell' it's easy:
spar...
Northward asked 13/8, 2019 at 8:28
5
Solved
I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.
This is the EMR step used in EMR Act...
Albinus asked 18/3, 2017 at 15:27
3
The following is the log dump of one of the container. I got an exception stating that a folder can't be created due to some permissions. I have troubleshooted various time but still it exist.
1...
Billat asked 19/12, 2016 at 11:41
4
Solved
I'm creating a cluter in EMR aws and when spark runs my application I'm getting error below:
Exception in thread "main" java.lang.UnsupportedClassVersionError:
com/example/demodriver/MyC...
Crowe asked 27/1, 2022 at 22:37
6
Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via
spark.catalog.setCurrentDatabase("test...
Dewain asked 19/9, 2017 at 3:29
4
I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations.
A summary of the problem I try to solve:
I have a dataframe of around 3...
Ethelethelbert asked 28/5, 2020 at 13:11
4
I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not...
Beaverboard asked 18/3, 2019 at 18:15
2
Solved
I am getting this error when writing a parquet file, this has started to happen recently
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your re...
Dunford asked 7/9, 2017 at 18:59
3
When EMR machine is trying to run a step that includes boto3 initialisation it sometimes get the following error:
ValueError: Invalid endpoint: https://s3..amazonaws.com
When I'm trying to set up ...
Pompous asked 15/9, 2019 at 10:5
2
I have a Spark application that I'm able to run locally. The dependencies I have are:
dependencies {
implementation "org.scala-lang:scala-library:${scalaVersion}"
implementation "...
Unused asked 13/7, 2021 at 13:0
7
This is very close to this question, but I have added a few details specific to my question:
Matplotlib Plotting using AWS-EMR jupyter notebook
I would like to find a way to use matplotlib inside...
Optional asked 22/5, 2019 at 21:0
4
OSM data is available in PBF format. There are specialised libraries (such as https://github.com/plasmap/geow for parsing this data).
I want to store this data on S3 and parse the data into an RDD...
Penmanship asked 23/11, 2016 at 0:11
4
I am running a spark-job on EMR cluster,The issue i am facing is all the
EMR jobs triggered are executing in steps (in queue)
Is there any way to make them run parallel
if not is there any a...
Tabethatabib asked 30/3, 2017 at 14:54
2
I have a glue table with column tlc and its datatype is Bigint.
I am trying to do the following using PySpark:
Read the Glue table and write it in a Dataframe
Join with another table
Write the re...
Fortuitous asked 24/3, 2020 at 3:34
1
Solved
I have created an AWS EMR cluster and notebook using default settings.
When I open the notebook, the kernel won't launch. I get the message "Workspace is not attached to cluster".
The cl...
Fez asked 4/4, 2022 at 1:21
4
Solved
Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs.
Millman asked 6/6, 2013 at 21:35
1 Next >
© 2022 - 2025 — McMap. All rights reserved.