How to pipe data from AWS Postgres RDS to S3 (then Redshift)?
Asked Answered
T

3

8

I'm using AWS data pipeline service to pipe data from a RDS MySql database to s3 and then on to Redshift, which works nicely.

However, I also have data living in an RDS Postres instance which I would like to pipe the same way but I'm having a hard time setting up the jdbc-connection. If this is unsupported, is there a work-around?

"connectionString": "jdbc:postgresql://THE_RDS_INSTANCE:5432/THE_DB”
Twotime answered 6/11, 2014 at 14:21 Comment(0)
K
5

this doesn't work yet. aws hasnt built / released the functionality to connect nicely to postgres. you can do it in a shellcommandactivity though. you can write a little ruby or python code to do it and drop that in a script on s3 using scriptUri. you could also just write a psql command to dump the table to a csv and then pipe that to OUTPUT1_STAGING_DIR with "staging: true" in that activity node.

something like this:

{
  "id": "DumpCommand",
  "type": "ShellCommandActivity",
  "runsOn": { "ref": "MyEC2Resource" },
  "stage": "true",
  "output": { "ref": "S3ForRedshiftDataNode" },
  "command": "PGPASSWORD=password psql -h HOST -U USER -d DATABASE -p 5432 -t -A -F\",\" -c \"select blah_id from blahs\" > ${OUTPUT1_STAGING_DIR}/my_data.csv"
}

i didn't run this to verify because it's a pain to spin up a pipeline :( so double check the escaping in the command.

  • pros: super straightforward and requires no additional script files to upload to s3
  • cons: not exactly secure. your db password will be transmitted over the wire without encryption.

look into the new stuff aws just launched on parameterized templating data pipelines: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html. it looks like it will allow encryption of arbitrary parameters.

Kristopher answered 28/11, 2014 at 17:42 Comment(1)
As a quick update: Postgresql now supports SSL so if you're on a newer version make sure to set ssl=true so encryption is setup.Zinfandel
C
14

Nowadays you can define a copy-activity to extract data from a Postgres RDS instance into S3. In the Data Pipeline interface:

  1. Create a data node of the type SqlDataNode. Specify table name and select query
  2. Setup the database connection by specifying RDS instance ID (the instance ID is in your URL, e.g. your-instance-id.xxxxx.eu-west-1.rds.amazonaws.com) along with username, password and database name.
  3. Create a data node of the type S3DataNode
  4. Create a Copy activity and set the SqlDataNode as input and the S3DataNode as output
Controller answered 6/12, 2015 at 17:28 Comment(2)
It's just too bad this isn't available in CA yetZinfandel
can you add a screenshot of what the result should look like? I get a pretty weird looking structure.Waterrepellent
K
5

this doesn't work yet. aws hasnt built / released the functionality to connect nicely to postgres. you can do it in a shellcommandactivity though. you can write a little ruby or python code to do it and drop that in a script on s3 using scriptUri. you could also just write a psql command to dump the table to a csv and then pipe that to OUTPUT1_STAGING_DIR with "staging: true" in that activity node.

something like this:

{
  "id": "DumpCommand",
  "type": "ShellCommandActivity",
  "runsOn": { "ref": "MyEC2Resource" },
  "stage": "true",
  "output": { "ref": "S3ForRedshiftDataNode" },
  "command": "PGPASSWORD=password psql -h HOST -U USER -d DATABASE -p 5432 -t -A -F\",\" -c \"select blah_id from blahs\" > ${OUTPUT1_STAGING_DIR}/my_data.csv"
}

i didn't run this to verify because it's a pain to spin up a pipeline :( so double check the escaping in the command.

  • pros: super straightforward and requires no additional script files to upload to s3
  • cons: not exactly secure. your db password will be transmitted over the wire without encryption.

look into the new stuff aws just launched on parameterized templating data pipelines: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-templates.html. it looks like it will allow encryption of arbitrary parameters.

Kristopher answered 28/11, 2014 at 17:42 Comment(1)
As a quick update: Postgresql now supports SSL so if you're on a newer version make sure to set ssl=true so encryption is setup.Zinfandel
N
1

AWS now allow partners to do near real time RDS -> Redshift inserts.

https://aws.amazon.com/blogs/aws/fast-easy-free-sync-rds-to-redshift/

Nicolettenicoli answered 22/1, 2015 at 20:54 Comment(2)
As of 2/25/2015 the above link only supports MySQL -> Redshift and doesnt have support for Postgres, which OP is asking about.Olympias
Looks like it's been updated since then. From the link: All four data integration solutions discussed above can be used with all RDS database engines (MySQL, SQL Server, PostgreSQL, and Oracle).Methodist

© 2022 - 2024 — McMap. All rights reserved.