aws-glue Questions

2

For our use case we need to load in json files from an S3 bucket. As processing tool we are using AWS Glue. But because we will soon be migrating to Amazon EMR, we are already developing our Glue j...
Weinrich asked 24/1, 2023 at 15:38

7

Solved

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the bett...
Circassia asked 7/6, 2020 at 20:19

3

Solved

I have an s3 bucket that I'm trying to crawl and catalog. The format is something like this, where the SQL files are DDL queries (CREATE TABLE statements) that match the schema of the different dat...
Gaikwar asked 15/2, 2018 at 16:55

4

Solved

I'm getting the following error when I try to create a development endpoint for AWS Glue. { "service":"AWSGlue", "statusCode":400, "errorCode":"Validati...
Soudan asked 12/2, 2018 at 19:30

2

Solved

I am trying to use the AWSGlue module in Python, but cannot install the module in the terminal. sh-4.2$ pip install awsglue Collecting awsglue Could not find a version that satisfies the requireme...
Unutterable asked 28/3, 2019 at 15:28

5

Solved

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM...
Clarence asked 20/8, 2019 at 20:54

4

Solved

I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded". Deleting a...
Essary asked 18/3, 2021 at 12:3

2

I’m trying to get a list of the tables from a database in my aws data catalog. I’m trying to use boto3. I’m running the code below on aws, in a sagemaker notebook. It runs forever (like over 30 min...
Heisser asked 7/8, 2019 at 20:1

9

Solved

How can I implement an optional parameter to an AWS Glue Job? I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would...
Durango asked 4/9, 2018 at 8:27

8

I'm trying to run a code that uses psycopg2 to manipulate a Redshift instance. I have tried by importing a wheel file as I see they are supported in Glue python jobs. I see the library is installed...
Communistic asked 4/8, 2020 at 11:34

10

Part One : I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned. But the demo data of ELB in At...

5

First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs....
Avraham asked 2/10, 2019 at 16:55

1

So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following: Load data from parquet files residing in an S3 bucket Apply a filter to ...
Hoick asked 20/4, 2022 at 13:53

4

Solved

Is there a temporary folder that I can access to hold files temporarily while running processes within AWS glue? For example, in Lambda we have access to a /tmp directory as long as the process is ...
Symphony asked 12/1, 2018 at 18:29

2

Solved

The documentation on toDF() method specifies that we can pass an options parameter to this method. But it does not specify what those options can be (https://docs.aws.amazon.com/glue/latest/dg/aws-...
Malpighi asked 5/10, 2020 at 19:54

1

I have a table in the AWS Glue catalog that has datatypes of all strings and the files are stored as parquet files in S3. I want to create a Glue job that will simply read the data in from that cat...
Reduplication asked 8/8, 2019 at 13:28

5

Solved

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table...
Hymn asked 15/9, 2017 at 13:44

2

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each ...
Blacking asked 13/9, 2018 at 15:8

2

I would like to see the custom logs that I create inside an AWS Sagemaker JupyterLab notebook (that uses a Glue development endpoint). I want to see them as the output of a notebook cell. I tried ...
Valentino asked 28/2, 2020 at 12:31

3

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes....
Bregma asked 24/11, 2019 at 4:25

2

Is there any way to run local master Spark SQL queries against AWS Glue? Launch this code on my local PC: SparkSession.builder() .master("local") .enableHiveSupport() .config("hive.metastore.c...

2

I have the below simple script for AWS Glue. I have a text file with empty cells and a table which accepts NULL values. When I run the glue job it fails with the exception, "Don't know how to save ...
Mizuki asked 28/11, 2017 at 0:24

1

In the image below we have the same glue job run with three different configurations in terms of how we write to S3: We used a dynamic frame to write to S3 We used a pure spark frame to write to S...
Delgado asked 21/12, 2021 at 8:25

2

This is my requirement: I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function. Questions: How can I add Crawler as the first state. What are the paramete...
Superintendency asked 29/1, 2020 at 11:20

4

My Athena queries appear to be too short in their results. Trying to figure out Why? Setup: Glue Catalogs (118.6 Gig in size). Data: Stored in S3 in both CSV and JSON format. Athena Query: Wh...
Must asked 18/1, 2018 at 19:26

© 2022 - 2024 — McMap. All rights reserved.