AWS Glue - Truncate destination postgres table prior to insert

Asked 2/11, 2017 at 17:16 Answered 25/6, 2022 at 3:6

Solved python postgresql pyspark aws-glue

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.

Has anyone been able to do so?

Depression answered 2/11, 2017 at 17:16 Comment(2)

Possible duplicate of Overwrite MySQL tables with AWS Glue – Narial 18/12, 2017 at 20:15

Did you manage to do this? I also want to do something like this. – Cotenant 25/4, 2018 at 3:34

I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.

Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.

Then you can simply create a connection and execute SQL.

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

import pg8000

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'PW',
    'HOST',
    'USER',
    'DB'
])
# ...
# Create Spark & Glue context

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ...
config_port = 5432
conn = pg8000.connect(
    database=args['DB'], 
    user=args['USER'], 
    password=args['PW'],
    host=args['HOST'],
    port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()

Glut answered 22/6, 2018 at 8:50 Comment(4)

Actually , I did the exact steps as you mentioned, but still I am getting "No module named pg8000". Can you help me where I went wrong. P.S I got lib from github.com/tlocke/pg8000 – Housman 24/4, 2019 at 3:6

I'm also getting the same error - No module named pg8000. how did you overcome this? is there anything else need to be added? – Protein 1/5, 2019 at 19:51

While this is good, does this guarantee transaction? Assume there is a data frame write statement after the connection, will the table still be truncated if there is some error while writing the data frame to DB? @Glut – Antichlor 6/7, 2020 at 10:7

To use the pg8000 module with a Glue job is now simpler. Go to 'Security configuration, script libraries, and job parameters (optional)', scroll down to 'Job Parameters' and enter --additional-python-modules for the key and pg8000 for the value.. and you're done :) – Polarize 15/7, 2021 at 20:35


data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
    .mode('overwrite') \
    .format("jdbc") \
    .option("url", conf['url']) \
    .option("database", 'Pacvue_Commerce') \
    .option("dbtable", "dbo.glue_1") \
    .option("user", conf['user']) \
    .option('truncate','true') \
    .option("password", conf['password']) \
    .save()

glue api not support , but spark api support.

jdbc-commerce is your connection name at crawl. use extract_jdbc_conf to get url、username and password.

Heinrick answered 25/6, 2022 at 3:6 Comment(1)

For this to work, I had to add .option("connschema", "<database>") to the options. – Slenderize 15/2, 2023 at 21:58

After following step (4) of @thenaturalist's response,

sc.addPyFile("/home/glue/downloads/python/pg8000.zip")

import pg8000

worked for me in a development endpoint (zeppelin notebook)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Houstonhoustonia answered 15/5, 2019 at 11:56 Comment(0)

To clarify @thenaturalist instructions for the zip as I still struggled with this

Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure

pg8000-1.15.3.zip
|
| -- pg8000 <dir>
    | -- __init__.py
    | -- _version.py <optional>
    | -- core.py

Upload to s3 and then you should be able to just do a simple import pg8000.

NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.

Pitchford answered 3/7, 2020 at 1:42 Comment(0)

Recommended topics

Hot tags