amazon-emr - McMap

7

Solved

What is the difference between AWS Glue ETL Job and AWS EMR?

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the bett...

amazon-web-services amazon-s3 etl amazon-emr aws-glue

Circassia asked 7/6, 2020 at 20:19

4

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...

hive pyspark amazon-emr parquet

Wreckful asked 10/1, 2020 at 1:13

4

Hive "Show Tables" Fails with MetaException

Using Hive 2.3.7 on AWS EMR (5.33.1) I have created a database which shows correctly when calling show databases;. I then create a table which seems to work correctly (no exceptions). When I call d...

hive amazon-emr

Inwrap asked 2/12, 2021 at 10:35

4

Problem Could not find any valid local directory for s3ablock-0001-

I'm facing a problem running Jobs on an Amazon EMR when I try to write data on S3. This is the stacktrace: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local dire...

apache-spark hadoop pyspark amazon-emr

Nd asked 13/10, 2020 at 20:14

5

Solved

Dealing with a large gzipped file in Spark

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each w...

apache-spark gzip amazon-emr

Beaty asked 8/11, 2016 at 17:26

6

Python pip install pyarrow error, unable to execute 'cmake'

I'm trying to install the pyarrow on a master instance of my EMR cluster, however I'm always receiving this error. [hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow Collecting p...

python-3.x cmake pip amazon-emr pyarrow

Willman asked 5/9, 2018 at 9:12

4

Solved

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...

apache-spark apache-spark-sql parquet amazon-emr

Cuckoopint asked 2/12, 2016 at 7:52

2

Solved

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the...

apache-spark amazon-ec2 amazon-emr scala-breeze jblas

Cloudless asked 16/6, 2016 at 1:1

3

Spark 2.2.0 - How to write/read DataFrame to DynamoDB

I want my Spark application to read a table from DynamoDB, do stuff, then write the result in DynamoDB. Read the table into a DataFrame Right now, I can read the table from DynamoDB into Spark a...

scala apache-spark amazon-dynamodb amazon-emr

Bertold asked 8/12, 2017 at 21:48

3

Solved

Adding external jars in EMR Notebooks

I use EMR Notebook connected to EMR cluster. Kernel is Spark and language is Scala. I need some jars that are located in S3 bucket. How can I add jars? In case of 'spark-shell' it's easy: spar...

scala apache-spark jupyter-notebook amazon-emr

Northward asked 13/8, 2019 at 8:28

5

Solved

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3. This is the EMR step used in EMR Act...

amazon-web-services hadoop amazon-s3 amazon-emr

Albinus asked 18/3, 2017 at 15:27

3

Cannot create temp dir with proper permission: /mnt1/s3

The following is the log dump of one of the container. I got an exception stating that a folder can't be created due to some permissions. I have troubleshooted various time but still it exist. 1...

amazon-web-services apache-spark amazon-s3 amazon-emr

Billat asked 19/12, 2016 at 11:41

4

Solved

How to use java runtime 11 in EMR cluster AWS

I'm creating a cluter in EMR aws and when spark runs my application I'm getting error below: Exception in thread "main" java.lang.UnsupportedClassVersionError: com/example/demodriver/MyC...

java apache-spark amazon-emr java-11

Crowe asked 27/1, 2022 at 22:37

6

Spark Catalog w/ AWS Glue: database not found

Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via spark.catalog.setCurrentDatabase("test...

apache-spark amazon-emr aws-glue

Dewain asked 19/9, 2017 at 3:29

4

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 3...

pyspark apache-spark-sql garbage-collection amazon-emr minhash

Ethelethelbert asked 28/5, 2020 at 13:11

4

EMR Cluster Creation using Airflow dag run, Once task is done EMR will be terminated

I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not...

apache-spark hadoop airflow amazon-emr

Beaverboard asked 18/3, 2019 at 18:15

2

Solved

S3 SlowDown error in Spark on EMR

I am getting this error when writing a parquet file, this has started to happen recently com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your re...

scala apache-spark amazon-s3 amazon-emr apache-spark-dataset

Dunford asked 7/9, 2017 at 18:59

3

ValueError: Invalid endpoint: https://s3..amazonaws.com

When EMR machine is trying to run a step that includes boto3 initialisation it sometimes get the following error: ValueError: Invalid endpoint: https://s3..amazonaws.com When I'm trying to set up ...

python amazon-web-services amazon-s3 boto3 amazon-emr

Pompous asked 15/9, 2019 at 10:5

2

What dependencies should be included when deploying a Spark application to EMR 6.x?

I have a Spark application that I'm able to run locally. The dependencies I have are: dependencies { implementation "org.scala-lang:scala-library:${scalaVersion}" implementation "...

amazon-web-services apache-spark gradle amazon-emr

Unused asked 13/7, 2021 at 13:0

7

How do I make matplotlib work in AWS EMR Jupyter notebook?

This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside...

python matplotlib pyspark jupyter-notebook amazon-emr

Optional asked 22/5, 2019 at 21:0

4

Processing (OSM) PBF files in Spark

OSM data is available in PBF format. There are specialised libraries (such as https://github.com/plasmap/geow for parsing this data). I want to store this data on S3 and parse the data into an RDD...

scala apache-spark amazon-emr osm.pbf

Penmanship asked 23/11, 2016 at 0:11

4

Running steps of EMR in parallel

I am running a spark-job on EMR cluster,The issue i am facing is all the EMR jobs triggered are executing in steps (in queue) Is there any way to make them run parallel if not is there any a...

web-services amazon-web-services apache-spark amazon-emr

Tabethatabib asked 30/3, 2017 at 14:54

2

Parquet column cannot be converted in file, Expected: bigint, Found: INT32

I have a glue table with column tlc and its datatype is Bigint. I am trying to do the following using PySpark: Read the Glue table and write it in a Dataframe Join with another table Write the re...

apache-spark pyspark amazon-emr parquet aws-glue

Fortuitous asked 24/3, 2020 at 3:34

1

Solved

Why can't EMR Notebook can't connect to its cluster when running as the AWS account owner

I have created an AWS EMR cluster and notebook using default settings. When I open the notebook, the kernel won't launch. I get the message "Workspace is not attached to cluster". The cl...

amazon-emr

Fez asked 4/4, 2022 at 1:21

4

Solved

Any Scala SDK or interface for AWS?

Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs.

scala amazon-web-services emr amazon-emr

Millman asked 6/6, 2013 at 21:35

amazon-emr Questions

Recommended topics

Hot tags