Amazon Redshift: Copying Data Between Databases
Asked Answered
T

5

6

I am looking to Copy data within databases on Amazon Redshift. Before this, I was copying data from a Redshift database to a PostgreSQL hosted on an EC2 instance for analytical purpose. I had ruby script that would do it using dblink EXTENSION.

But now since the data is un-managable on PostgreSQL instance, we have decided to copy the data to a separate database, on the same Redshift cluster.

Bump! What I found was that dblink is not available for Postgre 8.0 (Redshift Version). Is there a way I can get this working on Redshift.

I don't want to use the Copy command and add a layer of S3 in between or even data-pipeline. You see, it's the same cluster, just different databases.

In case, there is no other way, shall I use data-pipeline or Copy to S3 and then back to Redshift. Or is it essentially the same in the end?

P.S. Any help/leads would be appreciated. I would do my hard work. Just need a heads up.

Tho answered 1/6, 2015 at 12:50 Comment(0)
S
6

This seems now possible (since June 3, 2016)

See: https://blogs.aws.amazon.com/bigdata/post/Tx1GQ6WLEWVJ1OX/JOIN-Amazon-Redshift-AND-Amazon-RDS-PostgreSQL-WITH-dblink

In the above article, they give an example to copy from redshift to an other database:

I'll annotate with (postgres cluster) and (redshift cluster) for clarity

Copying data from Amazon Redshift to RDS PostgreSQL

The SQL to create the table: (postgres cluster)

CREATE TABLE sales_summary (
   saletime timestamp,
   sellerid int,
   sales decimal
);

[...]

The following query shows how to insert data into the tables using the dblink function:

copy is from (redshift cluster) to (postgres cluster)

INSERT INTO sales_summary
SELECT *
FROM dblink('foreign_server',$REDSHIFT$
    SELECT date_trunc('hours', saletime) AS ts, sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= '2008-01-01'
    AND saletime < '2008-02-01'
    GROUP BY ts, sellerid 
    ORDER BY sales 
$REDSHIFT$) AS t1 (saletime timestamp, sellerid int, sales decimal);

Edit: I've used this in production to load tables with many millions of rows daily, and I haven't had a single issue related to it. So I would recommend it as the preferred method for loading from Redshift to RDS-postgres (As opposed to COPYing, or worse, loading the data in memory with an ORM)

Scrappy answered 29/8, 2016 at 16:59 Comment(2)
The question was about copying from a Redshift database to a redshift database on the same Redshift cluster, but this answer talks about copying to-from a postgres database.Exercitation
@rado. Yes I didn't read carefully. In this case, UNLOAD then COPY seems to be the preferred way (even though OP would like something different).Scrappy
I
2

There is no way to access table from two different database at same time in query.

You should unload data from on database using unload command to s3 and then load it new database table using copy commands.

Intensifier answered 2/6, 2015 at 9:5 Comment(1)
How about using AWS Data Pipeline for this?Tho
F
1

You can script this pipeline using Python/psycopg2/boto.

  • Python connects to source PostgreSQL and opens data pipe for reading.
  • Data stream is compressed and pumped to S3 using multipart upload.
  • Data is loaded to Redshift from S3 using COPY command

I did just that in PostgreSQL-to-Redshift-Data-Loader

Freetown answered 9/9, 2016 at 15:10 Comment(0)
D
0

Here's the complete shell script - change the export from MySQL to PostgreSQL

s3bucket=s3://yours3bucket/yourfolder
aws s3 ls $s3bucket/yourtable.txt.gz
aws s3 cp /tmp/yourtable.txt.gz $s3bucket/yourtable.txt.gz   
aws s3 ls $s3bucket/yourtable.txt.gz
if [ "$?" = "0" ];
 then
   echo "File sucessfully uploaded to S3 bucket."
 else
   echo "Upload to S3 bucket failed." 
fi

After exporting your file from PostgreSQL to Amazon S3, you can just do this:

psql -U$postgres_user -h $postgres_host -p $postgres_port $redshift_database -c 
"copy yourdatabase.yourtable from '$s3bucket/yourtable.txt.gz' credentials 
'aws_access_key_id=ajgaslgkshtasj; aws_secret_access_key=shfs/34fiethajthaslthksthask'
delimiter '|' gzip removequotes ESCAPE ACCEPTINVCHARS ACCEPTANYDATE;"
Designate answered 4/2, 2017 at 11:38 Comment(0)
U
0

If you were having Tables in two different cluster, then better you can go with S3 in the middle. The advantage is if the data is huge copying that data from S3 to Redshift will very much faster than the copying from your local clients to Redshift. It is faster because of using parallel processing in between the aws services.

So the steps are.

1) Download the data as file (txt) from SQL Workbench/j -> Database Explorer -> Table -> right click -> Export Data -> Type(Text) & Field delimiter (, comma)

2) Rename the file extension as CSV and upload it to S3 bucket on specified folder.

3) Now login to your DB (via SQL Workbench/j ) and run the below command.

copy fact_transaction from ‘s3://bucket_name/folder_name’
CREDENTIALS ‘aws_access_key_id=XXXXXXX;aws_secret_access_key=YYYYYYYY’
delimiter ‘,’
csv
region ‘us-east-1’;
Undesigning answered 8/2, 2018 at 13:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.