python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?
Asked Answered
L

13

89

I have a dataframe with ca 155,000 rows and 12 columns. If I export it to csv with dataframe.to_csv , the output is an 11MB file (which is produced instantly).

If, however, I export to a Microsoft SQL Server with the to_sql method, it takes between 5 and 6 minutes! No columns are text: only int, float, bool and dates. I have seen cases where ODBC drivers set nvarchar(max) and this slows down the data transfer, but it cannot be the case here.

Any suggestions on how to speed up the export process? Taking 6 minutes to export 11 MBs of data makes the ODBC connection practically unusable.

Thanks!

My code is:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)
conn = engine.connect()

metadata = MetaData(conn)

my_data_frame.to_sql(TableName,engine)
Laic answered 17/4, 2015 at 17:55 Comment(5)
The only thing I can think of is to export just the structure, i.e. column names and data types but no rows, to SQL, then export the file to CSV and use something like the import/export wizard to append the CSV file to the SQL table. This way I don't have to define all the column types again; this is important because import tools tend to read the first x rows to guess data types, and if the first rows are all NULLs the guess will be wrong. However, the fact remains that the to_sql method is practically unusable other than for tiny tables. Have you experienced this with other databases, too?Laic
I tried the same at home, with a SQL Server Express running on my same PC, and python took 2 minutes to transfer a dataframe of 1 million rows x 12 columns of random number to SQL (size in CSV = 228MB). Not super fast but acceptable. It took 6 minutes (for a much smaller file) on a work PC connecting to a SQL server just a few miles away. Do you know if there is any parameter in pandas, sqlalchemy or pyodbc to speed up the transfer? I connect to that same SQL server a lot with many other tools, and it's never that slow. Thanks!Laic
Anyone? I also verified that the pandas.read_sql_table method is reasonably fast. It's only writing that is slow, even when writing a table with no constraints. Any ideas? I can't be the only one who has ever experienced this, yet I cnanot seem to find any documentation on this online... :(Laic
Maybe try and breaking it down by chunksize? So, for example, have a for loop go through chunks of 10,000 rows (my_data_frame.to_sql(TableName,engine,chunksize=10000)).Kalbli
Or just export the data to a csv and then use bulk insert (which is very, very fast). You will have to build a format file but it might be worth it. linkKalbli
D
117

I recently had the same problem and feel like to add an answer to this for others. to_sql seems to send an INSERT query for every row which makes it really slow. But since 0.24.0 there is a method parameter in pandas.to_sql() where you can define your own insertion function or just use method='multi' to tell pandas to pass multiple rows in a single INSERT query, which makes it a lot faster.

Note that your Database may has a parameter limit. In that case you also have to define a chunksize.

So the solution should simply look like to this:

my_data_frame.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')

If you do not know your database parameter limit, just try it without the chunksize parameter. It will run or give you an error telling you your limit.

Dross answered 19/9, 2019 at 11:53 Comment(8)
I have included method='multi' when loading data to Postgres and it speeded up loading like 1000 times :) Data with 900k rows was not able to complete within 6h. But when I used 'multi' it took 5 min. Thanks for the tip.Maeganmaelstrom
Same. This worked like a charm. Somebody can explaiini chunksize in the answer?Galvano
@Galvano As said in my answer, sending all rows at once may exceed your database parameter limit and lead to an error. To avoid this you can specify a chunksize. This will divide the insert into chunks of number of rows you specified in chunksize. If your database has a parameter limit of e.g. 100 000 and your DataFrame has 1 million rows it would fail unless you add chunksize=100000.Dross
many upload jobs I was doing would completely time-out. Now I added method='multi' and scripts work perfectly. Thanks so much for your post!Sergiosergipe
Why this is not the default param ;-(Bridgework
I had the problem with MS-SQL and it returned "maximum parameters 2100" so I divided 2100 by the number of columns which results in ~56 and took 50 as chunksize. Still an extreme speed-up (the table has ~3000 rows to be inserted)Heterogenous
@Heterogenous This is a relatively low parameter limit. Did you try using 2100 as chunksize? Otherwise: The proper way to load massive data into a database is a bulk insert from a csv file. It may seems strange, but databases are extremely optimized for that. So writing a csv file and loading it into the database can often be much faster. Instead of using the string 'multi' it is possible to write your own function for the insert. One even could implement writing a temporary csv file and loading it into the database.Dross
As mentioned by @Fips, for SQL server you need 2100//len(df.columns) for your chunksize, and the error message you get may be quite cryptic if you don't do this. Also, for my table with a huge number of columns and around 10,000 inserts, there was no speed-up, so your mileage may vary.Snore
G
21

The DataFrame.to_sql method generates insert statements to your ODBC connector which then is treated by the ODBC connector as regular inserts.

When this is slow, it is not the fault of pandas.

Saving the output of the DataFrame.to_sql method to a file, then replaying that file over an ODBC connector will take the same amount of time.

The proper way of bulk importing data into a database is to generate a csv file and then use a load command, which in the MS flavour of SQL databases is called BULK INSERT

For example:

BULK INSERT mydatabase.myschema.mytable
FROM 'mydatadump.csv';

The syntax reference is as follows:

BULK INSERT 
   [ database_name . [ schema_name ] . | schema_name . ] [ table_name | view_name ] 
      FROM 'data_file' 
     [ WITH 
    ( 
   [ [ , ] BATCHSIZE = batch_size ] 
   [ [ , ] CHECK_CONSTRAINTS ] 
   [ [ , ] CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] 
   [ [ , ] DATAFILETYPE = 
      { 'char' | 'native'| 'widechar' | 'widenative' } ] 
   [ [ , ] FIELDTERMINATOR = 'field_terminator' ] 
   [ [ , ] FIRSTROW = first_row ] 
   [ [ , ] FIRE_TRIGGERS ] 
   [ [ , ] FORMATFILE = 'format_file_path' ] 
   [ [ , ] KEEPIDENTITY ] 
   [ [ , ] KEEPNULLS ] 
   [ [ , ] KILOBYTES_PER_BATCH = kilobytes_per_batch ] 
   [ [ , ] LASTROW = last_row ] 
   [ [ , ] MAXERRORS = max_errors ] 
   [ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ] 
   [ [ , ] ROWS_PER_BATCH = rows_per_batch ] 
   [ [ , ] ROWTERMINATOR = 'row_terminator' ] 
   [ [ , ] TABLOCK ] 
   [ [ , ] ERRORFILE = 'file_name' ] 
    )] 
Guzman answered 11/6, 2015 at 9:3 Comment(3)
Those interested in doing a BULK INSERT into SQL Server via Python might also be interested in having a look at my answer to a related question.Cypher
So. Fast...FastCoolish
This isn't a valid solution if data_file isn't path from the server on which SQL Server is runningDup
P
13

You can use this: what makes it faster is the method parameter of pandas to_sql. I hope this help helps.

The result of this on my experience was from infinite time to 8 secs.


df = pd.read_csv('test.csv')

conn = create_engine(<connection_string>)

start_time = time.time()
df.to_sql('table_name', conn, method='multi',index=False, if_exists='replace')
print("--- %s seconds ---" % (time.time() - start_time))
Privilege answered 16/1, 2020 at 4:33 Comment(1)
this works as of March 2022Breland
T
10

With SQLAlchemy>=1.3, while creating engine object, set fast_executemany=True. Reference

Tourcoing answered 21/9, 2021 at 7:53 Comment(1)
The "multi" solution that helped everyone else did not help me, but this did. Thank you.Snore
W
6

You can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports MS SQL, Postgres and MYSQL

uri_psql = 'postgresql+psycopg2://usr:pwd@localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
uri_mssql = 'mssql+pymssql://usr:pwd@localhost/db'
d6tstack.utils.pd_to_mssql(df, uri_mssql, 'table', 'schema') # experimental

Also useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook

d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), 
    apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')
Wethington answered 31/10, 2018 at 21:2 Comment(2)
where do I put user name and password in the connection string? can you give an example please?Delaine
Thanks for this! Ran pd_to_psql on a 30 million row pandas dataframe and it took ~4 minutes.Eliza
Z
6

Why is pandas.DataFrame.to_sql slow?

When uploading data from pandas to Microsoft SQL Server, most time is actually spent in converting from pandas to Python objects to the representation needed by the MS SQL ODBC driver. One of the reasons pandas is much faster for analytics than basic Python code is that it works on lean native arrays of integers / floats / … that don't have the same overhead as their respective Python counterparts. The to_sql method is actually converting all these lean columns to many individual Python objects and thus doesn't get the usual performance treatment as the other pandas operations have.

Use turbodbc.Cursor.insertmanycolumns to speed this up

Given a pandas.DataFrame, you can use turbodbc and pyarrow to insert the data with less conversion overhead than happening with the conversion to Python objects.

import pyarrow as pa
import turbodbc

cursor = …  # cursor to a MS SQL connection initiated with turbodbc
df = …  # the pd.DataFrame to be inserted

# Convert the pandas.DataFrame to a pyarrow.Table, most of the columns
# will be zero-copy and thus this is quite fast.
table = pa.Table.from_pandas(table)

# Insert into the database
cursor.executemanycolumns("INSERT INTO my_table VALUES (?, ?, ?)",
                           table)

Why is this faster?

Instead of the conversion of pd.DataFrame -> collection of Python objects -> ODBC data structures, we are doing a conversion path pd.DataFrame -> pyarrow.Table -> ODBC structure. This is more performant due to:

  • Most of the columns of a pandas.DataFrame can be converted to columns of the pyarrow.Table without copying. The columns of the table will reference the same memory. So no actual conversion is done.
  • The conversion is done fully in native code with native types. This means that at no stage we occur the overhead of Python objects as long as we don't have object typed columns.
Zincate answered 1/7, 2020 at 7:25 Comment(1)
While an astute analysis, I cannot use this solution since the pyarrow library is so large (50MB). AWS Lambdas have a 250MB limit for all libraries and in our case we're already close to that. IOW, solutions that don't use additional libraries are always preferred. Nonetheless, again, this is a great point to consider when analyzing slow performance.Lafferty
S
6

For sqlalchemy >= 1.3, rather than using to_sql()'s method parameter, use fast_executemany=True in sqlalchemy's create_engine(). This should be at least as fast as method="multi" while avoiding T-SQL's limit of 2100 parameter values for a stored procedure, which causes the error seen here.

Credit to Gord Thompson from the same link.

Science answered 5/10, 2022 at 18:59 Comment(1)
It looks like for MS ODBC it's best to leave to_sql as default (not multi) and use fast_executemany in the engine.Shadrach
S
3

I was running out of time and memory (more than 18GB allocated for a DataFrame loaded from 120MB CSV) with this line:

df.to_sql('my_table', engine, if_exists='replace', method='multi', dtype={"text_field": db.String(64), "text_field2": db.String(128), "intfield1": db.Integer(), "intfield2": db.Integer(), "floatfield": db.Float()})

Here is the code that helped me to import and track progress of insertions at the same time:

import sqlalchemy as db
engine = db.create_engine('mysql://user:password@localhost:3306/database_name', echo=False)
connection = engine.connect()
metadata = db.MetaData()

my_table = db.Table('my_table', metadata,
              db.Column('text_field', db.String(64), index=True),
              db.Column('text_field2', db.String(128), index=True),
              db.Column('intfield1', db.Integer()),
              db.Column('intfield2', db.Integer()),
              db.Column('floatfield', db.Float())
             )
metadata.create_all(engine)
kw_dict = df.reset_index().sort_values(by="intfield2", ascending=False).to_dict(orient="records")

batch_size=10000
for batch_start in range(0, len(kw_dict), batch_size):
    print("Inserting {}-{}".format(batch_start, batch_start + batch_size))
    connection.execute(my_table.insert(), kw_dict[batch_start:batch_start + batch_size])
Saltillo answered 18/2, 2020 at 23:56 Comment(1)
you should add chunksize parameter to the pd.to_sql, like 5000. As the documents says chunksize int, Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.Prent
M
2

Based on this answer - Aseem.

You can use the copy_from method to simulate a bulk load with a cursor object. This was tested on Postgres, try it with your DB:

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
from StringIO import StringIO

ServerName = "myserver"
Database = "mydatabase"
TableName = "mytable"

engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database) #don't forget to add a password if needed

my_data_frame.head(0).to_sql(TableName, engine, if_exists='replace', index=False)  # create an empty table - just for structure
conn = engine.raw_connection()
cur = conn.cursor()
output = StringIO()
my_data_frame.to_csv(output, sep='\t', header=False, index=False) # a CSV that will be used for the bulk load
output.seek(0)
cur.copy_from(output, TableName, null="")  # null values become ''
conn.commit()
conn.close()
cur.close()
Misdeed answered 16/4, 2019 at 10:2 Comment(0)
S
2

My solution to this problem is below if this helps anyone. From what I've read, pandas tosql method loads one record at a time.

You can make a bulk insert statement that loads 1000 lines and commits that transaction instead of committing a single row each time. This increases the speed massively.

import pandas as pd
from sqlalchemy import create_engine
import pymssql
import os

connect_string  = [your connection string]
engine = create_engine(connect_string,echo=False)
connection = engine.raw_connection()
cursor = connection.cursor()

def load_data(report_name):
    # my report_name variable is also my sql server table name so I use that variable to create table name string
    sql_table_name = 'AR_'+str(report_name)
    global chunk # to QC chunks that fail for some reason
    for chunk in pd.read_csv(report_full_path_new,chunksize=1000):
        chunk.replace('\'','\'\'',inplace=True,regex=True) #replace single quotes in data with double single quotes to escape it in mysql
        chunk.fillna('NULL',inplace=True)

        my_data = str(chunk.to_records(index=False).tolist()) # convert data to string 
        my_data = my_data[1:-1] # clean up the ends
        my_data = my_data.replace('\"','\'').replace('\'NULL\'','NULL') #convert blanks to NULLS for mysql
        sql_table_name = [your sql server table name]

        sql = """
        INSERT INTO {0} 
        VALUES {1}

         """.format(sql_table_name,my_data)

        cursor.execute(sql)
        # you must call commit() to persist your data if you don't set autocommit to True
        connection.commit()
Spelldown answered 5/6, 2019 at 14:19 Comment(0)
S
1

As said in other answers, the reason for the slowdown and/or time out is because pandas is inserting many single rows over and over. The high volume of insert commands is slow and/or may be overloading the target database.

using method='multi' tells pandas to upload in chunks. This is much faster and won't time out as easily.

sqlEngine=create_engine('mysql+mysqlconnector://'+config['user']+':'+config['pass']+'@'+config['host']+'/'+config['dbname'])
dbConnection=sqlEngine.connect()
df.to_sql('table_name',con=dbConnection,method='multi',if_exists='append',index=False)
dbConnection.close()
Sergiosergipe answered 2/3, 2021 at 3:35 Comment(0)
B
1

I had a large csv (7m rows, 40 columns) that I was trying to import with speed somewhere near BULK INSERT or bcp. After trying a lot of different methods, I found that the way df.to_sql inserts rows was just not going to be performant enough.

This lead me to find pymssql and the bulk_copy command. Although this is not using df.to_sql I found it was simple enough to convert the df to a list of tuples and import using this method.

I found that the import time improved from ~2 hours using df.to_sql and fast_executemany to ~10 minutes which is a huge improvement.

Example Below:


conn = pymssql.connect(server=server, user=user, password=password, database=database)

df = pd.read_csv('/path/to/file.csv')

# Do some transformations to match table format
df = df.....

# Convert to a list of tuples
data = [tuple(x) for x in df.to_numpy()]

# Load data into the table
conn.bulk_copy("TableName", data)
conn.commit()

NOTE: You could also autocreate the table using df.to_sql with zero rows instead of transforming to fit an existing table.

Hope this helps someone.

Bellerophon answered 27/2 at 12:5 Comment(4)
Good suggestion, just be aware that pymssql .bulk_copy() cannot deal with nvarchar columns.Cypher
@GordThompson yes you're right, it's another step but a workaround could insert into a temp table then INSERT INTO YourTable SELECT * FROM #TempTable or something.Bellerophon
or use the JSON trick described hereCypher
or the .to_sql() method hereCypher
G
0

Probably the pyarrow answer above is best, but for mariadb, I wrote a wrapper on DataFrame to use executemany and fetchall, which gave me a 300x speedup. This also had the added bonus of not using sqlalchemy at all.

You can use it as normal: df.to_sql(...), or df = read_sql_table(...).

See https://gist.github.com/MichaelCurrie/b5ab978c0c0c1860bb5e75676775b43b

Germanize answered 22/1, 2023 at 14:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.