AWS Glue - Truncate destination postgres table prior to insert
Asked Answered
D

4

11

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.

Has anyone been able to do so?

Depression answered 2/11, 2017 at 17:16 Comment(2)
Possible duplicate of Overwrite MySQL tables with AWS GlueNarial
Did you manage to do this? I also want to do something like this.Cotenant
G
16

I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.

  1. Download the tar of pg8000 from pypi
  2. Create an empty __init__.py in the root folder
  3. Zip up the contents & upload to S3
  4. Reference the zip file in the Python lib path of the job
  5. Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.

Then you can simply create a connection and execute SQL.

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

import pg8000

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'PW',
    'HOST',
    'USER',
    'DB'
])
# ...
# Create Spark & Glue context

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ...
config_port = 5432
conn = pg8000.connect(
    database=args['DB'], 
    user=args['USER'], 
    password=args['PW'],
    host=args['HOST'],
    port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
Glut answered 22/6, 2018 at 8:50 Comment(4)
Actually , I did the exact steps as you mentioned, but still I am getting "No module named pg8000". Can you help me where I went wrong. P.S I got lib from github.com/tlocke/pg8000Housman
I'm also getting the same error - No module named pg8000. how did you overcome this? is there anything else need to be added?Protein
While this is good, does this guarantee transaction? Assume there is a data frame write statement after the connection, will the table still be truncated if there is some error while writing the data frame to DB? @GlutAntichlor
To use the pg8000 module with a Glue job is now simpler. Go to 'Security configuration, script libraries, and job parameters (optional)', scroll down to 'Job Parameters' and enter --additional-python-modules for the key and pg8000 for the value.. and you're done :)Polarize
H
3

data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
    .mode('overwrite') \
    .format("jdbc") \
    .option("url", conf['url']) \
    .option("database", 'Pacvue_Commerce') \
    .option("dbtable", "dbo.glue_1") \
    .option("user", conf['user']) \
    .option('truncate','true') \
    .option("password", conf['password']) \
    .save()

glue api not support , but spark api support.

jdbc-commerce is your connection name at crawl. use extract_jdbc_conf to get url、username and password.

Heinrick answered 25/6, 2022 at 3:6 Comment(1)
For this to work, I had to add .option("connschema", "<database>") to the options.Slenderize
H
1

After following step (4) of @thenaturalist's response,

sc.addPyFile("/home/glue/downloads/python/pg8000.zip")

import pg8000

worked for me in a development endpoint (zeppelin notebook)

More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Houstonhoustonia answered 15/5, 2019 at 11:56 Comment(0)
P
1

To clarify @thenaturalist instructions for the zip as I still struggled with this


Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure

pg8000-1.15.3.zip
|
| -- pg8000 <dir>
    | -- __init__.py
    | -- _version.py <optional>
    | -- core.py

Upload to s3 and then you should be able to just do a simple import pg8000.

NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.

Pitchford answered 3/7, 2020 at 1:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.