pyodbc - very slow bulk insert speed
Asked Answered
M

7

22

With this table:

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

the following code takes 40 seconds to run:

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?

EDIT: Add some notes following ghoerz's discovery

In pyodbc, the flow of executemany is:

  • prepare statement
  • loop for each set of parameters
    • bind the set of parameters
    • execute

In ceODBC, the flow of executemany is:

  • prepare statement
  • bind all parameters
  • execute
Mendel answered 17/4, 2011 at 13:50 Comment(6)
Try using an explicit transaction.Heave
Reading #1064270, it doesn't seem like pyodbc has support for explicit transaction.Mendel
That's not the way I read it. You turn off auto-commit, and have to explicitly call rollback or commit. However, I have no idea if it makes a difference or not, but it would be something I would try myself.Heave
What you described is exactly what my code does. Autocommit is off by default.Mendel
I don't see any reason for this to be slow. What version of SQL Server, and is the installation a standard installation, i.e. no funny configs etc? Like running databases from USB etc? You can also try and attach SQL Profiler to the db and see if you can spot where the inefficiency comes from, but your equivalent code in c# executes in less than 3 seconds on my pc.Ema
Ryk, I think this problem is specific to python bindings for mssql, and Lasse had the right suggestion, i.e. to get pyodbc to wrap the 10000 INSERT's in one transaction. Unfortunately, I couldn't find a way to do that, and has since changed my code to use BULK INSERT instead.Mendel
L
12

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

Latton answered 29/3, 2012 at 19:43 Comment(1)
Thanks for confirmation and tips. I have filed this as code.google.com/p/pyodbc/issues/detail?id=250Mendel
A
8

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()
Alaric answered 24/9, 2018 at 8:5 Comment(2)
Thanks a bunch man! Saved may day....Subfusc
just flagging that for pyodbc.connect, autocommit=False by default (i think since 2020 or even earlier). and echoing your answer that 1. fast_executemany = True helps and 2. using 5-25k chunks also appears, anecdotally, to help.Consecutive
N
3

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()
Naga answered 5/4, 2012 at 21:3 Comment(0)
B
3

pyodbc 4.0.19 added a Cursor#fast_executemany option to help address this issue. See this answer for details.

Birthstone answered 1/11, 2017 at 15:31 Comment(0)
G
1

I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.

Gamekeeper answered 31/5, 2017 at 19:45 Comment(0)
S
0

I was using pypyODBC w/ python 3.5 and Microsoft SQL Server Management Studio. A particular table ( ~70K rows w/ 40 vars) was taking 112 seconds to INSERT using the .executemany() method with pypyodbc.

With ceODBC it took 4 seconds.

Saritasarkaria answered 26/2, 2017 at 21:51 Comment(0)
H
0

I fount it insane while uploading ~1.6 million rows (from ~195 csv files) to a Microsoft SQL Server database table. Did lots of research over the net, but nothing worked for me. After reading this article Why do the queries slow down after a few minutes, when trying to build a DB?, I looked into the table, how it is structured. My table had 15 columns and all of them had indexes. I deleted all of them and ran my python script, it was a damn miracle. My upload time reduced by 98%.

Hap answered 1/8, 2023 at 11:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.