How to write data to Redshift that is a result of a dataframe created in Python?
Asked Answered
T

7

40

I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql queries. Now I need to write a dataframe to it.

Tasso answered 15/7, 2016 at 18:33 Comment(1)
May be you can provide the code you currently have to make it easy to provide an answer.Linolinocut
H
64

You can use to_sql to push data to a Redshift database. I've been able to do this using a connection to my database through a SQLAlchemy engine. Just be sure to set index = False in your to_sql call. The table will be created if it doesn't exist, and you can specify if you want you call to replace the table, append to the table, or fail if the table already exists.

from sqlalchemy import create_engine
import pandas as pd

conn = create_engine('postgresql://username:[email protected]:5439/yourdatabase')

df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])

df.to_sql('your_table', conn, index=False, if_exists='replace')

Note that you may need to pip install psycopg2 in order to connect to Redshift through SQLAlchemy.

to_sql Documentation

Holmium answered 26/9, 2016 at 18:24 Comment(8)
does if_exists='replace' work for you? it doesn't do anything for meAdah
Yes, if the table already exists then the contents of the table are replaced with the contents of the dataframe.Holmium
@Andrew, does the to_sql method in pandas take advantage of Redshift's MPP architecture? I noticed that it takes a bit of time to copy a DF with 22K rowsEnchilada
@lollerskates, no, it doesn't, as far as I know. Pandas doesn't even know it's communicating with Redshift specifically since you pass the connection/engine object into the method. One possible reason that the insert is slow is because when pandas submits the SQL command, it's not doing one insert with 22K records; it's inserting each record individually. I've been able to speed things up by monkey patching pandas so that it will do a bulk insert, as explained here: github.com/pandas-dev/pandas/issues/8953Holmium
@Holmium I have trouble configure your link "postgresql://username:[email protected]:5439/yourdatabase" Can you please provide more detail how to replace each element using my credentials?Ln
how to add data types while using to_sql. Like json data for redshift SUPER type?Sylvanus
If you need to add the dataframe to an existing table you can use if_exists ='append'Jaramillo
I get an error: AttributeError: 'Engine' object has no attribute 'cursor'Tinstone
K
17
import pandas_redshift as pr

pr.connect_to_redshift(dbname = <dbname>,
                        host = <host>,
                        port = <port>,
                        user = <user>,
                        password = <password>)

pr.connect_to_s3(aws_access_key_id = <aws_access_key_id>,
                aws_secret_access_key = <aws_secret_access_key>,
                bucket = <bucket>,
                subdirectory = <subdirectory>)

# Write the DataFrame to S3 and then to redshift
pr.pandas_to_redshift(data_frame = data_frame,
                        redshift_table_name = 'gawronski.nba_shots_log')

Details: https://github.com/agawronski/pandas_redshift

Kaneshakang answered 2/8, 2017 at 5:18 Comment(4)
I am getting this error "current transaction is aborted, commands ignored until end of transaction block". Any idea about this?Isiahisiahi
I used Red Panda package and installed it using the command: "pip install red-panda". It was better suited for me to complete the task in 2 mins.Isiahisiahi
Isn't there any other way to connect to S3? boto3 connects s3 bucket itself in aws lambdaBunnell
this is a wrapper over boto3.Kaneshakang
T
11

I tried using pandas df.to_sql() but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See this open issue (as of writing)

I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into.

Finally what worked:

import psycopg2

# Fill in the blanks for the conn object
conn = psycopg2.connect(user = 'user',
                              password = 'password',
                              host = 'host',
                              dbname = 'db',
                              port = 666)
cursor = conn.cursor()

# Adjust ... according to number of columns
args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))

cursor.close()
conn.commit()
conn.close()

Yep, plain old psycopg2. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. This gave me around 3k rows/minute.

However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)

Trabzon answered 18/7, 2017 at 17:0 Comment(5)
this way insert 100k rows for 1 minutesDevereux
Could you explain what needs to be placed in the ... places in your two lines args_str and cursor.execute?Lynnlynna
Hi @JonasPalačionis, that is a placeholder for the number of columns you have in your data. For 2 columns, it would be (%s,%s) and the cursor.execute would be (a,b) assuming your columns are named a and b.Trabzon
I get this error while trying this sol: SyntaxError: syntax error at or near "table" LINE 1: insert into table (id,type,customer,customer_id,generation_d... ^Infare
np_data = df. to_numpy() but your cur. execute () showing syntec error. please fix.Bunnell
L
6

Assuming you have access to S3, this approach should work:

Step 1: Write the DataFrame as a csv to S3 (I use AWS SDK boto3 for this)
Step 2: You know the columns, datatypes, and key/index for your Redshift table from your DataFrame, so you should be able to generate a create table script and push it to Redshift to create an empty table
Step 3: Send a copy command from your Python environment to Redshift to copy data from S3 into the empty table created in step 2

Works like a charm everytime.

Step 4: Before your cloud storage folks start yelling at you delete the csv from S3

If you see yourself doing this several times, wrapping all four steps in a function keeps it tidy.

Luigiluigino answered 4/2, 2017 at 23:50 Comment(1)
I use the same solution to dump over a million rows. I chunk them into 100k rows at a time into a csv file, then use manifest to import at a go. The only problem is that I was not able to gzip the files to speed up COPY.Interlanguage
G
5

I used to rely on pandas to_sql() function, but it is just too slow. I have recently switched to doing the following:

import pandas as pd
import s3fs # great module which allows you to read/write to s3 easily
import sqlalchemy

df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])

s3 = s3fs.S3FileSystem(anon=False)
filename = 'my_s3_bucket_name/file.csv'
with s3.open(filename, 'w') as f:
    df.to_csv(f, index=False, header=False)

con = sqlalchemy.create_engine('postgresql://username:[email protected]:5439/yourdatabase')
# make sure the schema for mytable exists

# if you need to delete the table but not the schema leave DELETE mytable
# if you want to only append, I think just removing the DELETE mytable would work

con.execute("""
    DELETE mytable;
    COPY mytable
    from 's3://%s'
    iam_role 'arn:aws:iam::xxxx:role/role_name'
    csv;""" % filename)

the role has to allow redshift access to S3 see here for more details

I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() function

Geoponics answered 10/1, 2019 at 16:34 Comment(1)
Adding method='multi' to my to_sql() call substantially sped it upCannabin
C
1

For the purpose of this conversation Postgres = RedShift You have two options:

Option 1:

From Pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL.

Writing DataFrames

Assuming the following data is in a DataFrame data, we can insert it into the database using to_sql().

id  Date    Col_1   Col_2   Col_3
26  2012-10-18  X   25.7    True
42  2012-10-19  Y   -12.4   False
63  2012-10-20  Z   5.73    True

In [437]: data.to_sql('data', engine)

With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. This can be avoided by setting the chunksize parameter when calling to_sql. For example, the following writes data to the database in batches of 1000 rows at a time:

In [438]: data.to_sql('data_chunked', engine, chunksize=1000)

Option 2

Or you can simply do your own If you have a dataframe called data simply loop over it using iterrows:

for row in data.iterrows():

then add each row to your database. I would use copy instead of insert for each row, as it will be much faster.

http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from

Christmann answered 18/7, 2016 at 18:28 Comment(1)
Thanks for your answer, but dataframes create indexes. Redshift doesn't support indexes so not sure if we can write data to Redshift as a dataframe .Tasso
I
0

Given all the answers were not able to solve my query so I googled and got the following snippet which completed the work in 2 mins. I am using Python 3.8.5 on windows.

from red_panda import RedPanda
import pandas as pd
df = pd.read_csv('path_to_read_csv_file')
redshift_conf = {
    "user": "username",
    "password": "password",
    "host": "hostname",
    "port": port number in integer,
    "dbname": "dbname",
}

aws_conf = {
    "aws_access_key_id": "<access_key>",
    "aws_secret_access_key": "<secret_key>",
    # "aws_session_token": "temporary-token-if-you-have-one",
}

rp = RedPanda(redshift_conf, aws_conf)
s3_bucket = "bucketname"
s3_path = "subfolder if any" # optional, if you don't have any sub folders
s3_file_name = "filename" # optional, randomly generated if not provided
rp.df_to_redshift(df, "table_name", bucket=s3_bucket, path=s3_path, append=False)

for more info check out the package on github here

Isiahisiahi answered 30/11, 2021 at 4:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.