Pipeline from AWS RDS to S3 using Glue
Asked Answered
S

1

6

I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3 tables in Athena .

  • Is there a way to directly pull data from AWS RDS to S3 tables in Athena ?
  • If Yes , then how
  • If not then any better suggestions are always welcome

Thanks in advance !

Sweatbox answered 11/12, 2018 at 3:54 Comment(0)
G
9

You can't pull data from AWS RDS to S3 using Athena. Athena is a query engine over S3 data. To be able to extract data from RDS to S3, you can run a Glue job to read from a particular RDS table and create S3 dump in parquet format which will create another external table pointing to S3 data. Then you can query that S3 data using Athena. A sample code snippet to read from RDS using Glue catalog and write parquet in S3 will look like below. There are some Glue predefined template which you can use to experiment. Start with a small table first. Please let me know if it worked out for you or any further questions/issues.

datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options = 
{"url": "jdbc-url/database",
"user": "user_name",
"password": "password",
"dbtable": "table_name"},
transformation_ctx = "datasource0")

   datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"}, format = "parquet", transformation_ctx = "datasink4")
Geum answered 13/12, 2018 at 11:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.