Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Asked Answered
M

9

77

I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function. It goes something like this:

 import pyodbc as pdb

 list_of_tuples = convert_df(data_frame)

 connection = pdb.connect(cnxn_str)

 cursor = connection.cursor()
 cursor.fast_executemany = True
 cursor.executemany(sql_statement, list_of_tuples)
 connection.commit()

 cursor.close()
 connection.close()

I then started to wonder if things can be sped up (or at least more readable) by using data_frame.to_sql() method. I have came up with the following solution:

 import sqlalchemy as sa

 engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % cnxn_str)
 data_frame.to_sql(table_name, engine, index=False)

Now the code is more readable, but the upload is at least 150 times slower...

Is there a way to flip the fast_executemany when using SQLAlchemy?

I am using pandas-0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13.

Mansour answered 28/12, 2017 at 11:22 Comment(0)
B
87

EDIT (2019-03-08): Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute') Meaning the below function can be removed and only the flag needs to be set in the create_engine statement - and still retaining the speed-up.

Original Post:

Just made an account to post this. I wanted to comment beneath the above thread as it's a followup on the already provided answer. The solution above worked for me with the Version 17 SQL driver on a Microsft SQL storage writing from a Ubuntu based install.

The complete code I used to speed things up significantly (talking >100x speed-up) is below. This is a turn-key snippet provided that you alter the connection string with your relevant details. To the poster above, thank you very much for the solution as I was looking quite some time for this already.

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus


conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USER;PWD=PASS"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df = pd.DataFrame(np.random.random((10**4, 100)))


s = time.time()
df.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)

Based on the comments below I wanted to take some time to explain some limitations about the pandas to_sql implementation and the way the query is handled. There are 2 things that might cause the MemoryError being raised afaik:

1) Assuming you're writing to a remote SQL storage. When you try to write a large pandas DataFrame with the to_sql method it converts the entire dataframe into a list of values. This transformation takes up way more RAM than the original DataFrame does (on top of it, as the old DataFrame still remains present in RAM). This list is provided to the final executemany call for your ODBC connector. I think the ODBC connector has some troubles handling such large queries. A way to solve this is to provide the to_sql method a chunksize argument (10**5 seems to be around optimal giving about 600 mbit/s (!) write speeds on a 2 CPU 7GB ram MSSQL Storage application from Azure - can't recommend Azure btw). So the first limitation, being the query size, can be circumvented by providing a chunksize argument. However, this won't enable you to write a dataframe the size of 10**7 or larger, (at least not on the VM I am working with which has ~55GB RAM), being issue nr 2.

This can be circumvented by breaking up the DataFrame with np.split (being 10**6 size DataFrame chunks) These can be written away iteratively. I will try to make a pull request when I have a solution ready for the to_sql method in the core of pandas itself so you won't have to do this pre-breaking up every time. Anyhow I ended up writing a function similar (not turn-key) to the following:

import pandas as pd
import numpy as np

def write_df_to_sql(df, **kwargs):
    chunks = np.split(df, df.shape()[0] / 10**6)
    for chunk in chunks:
        chunk.to_sql(**kwargs)
    return True

A more complete example of the above snippet can be viewed here: https://gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py

It's a class I wrote that incorporates the patch and eases some of the necessary overhead that comes with setting up connections with SQL. Still have to write some documentation. Also I was planning on contributing the patch to pandas itself but haven't found a nice way yet on how to do so.

I hope this helps.

Boykin answered 28/12, 2017 at 11:22 Comment(25)
Thanks for this. I'm sure it will help! One more thing to note: pyODBC has to be at 4.0.19 or above.Mansour
my code looks like yours except the engine creation: engine = sa.create_engine('mssql+pyodbc://SERVER/DATABASE?driver=SQL+Server+Native+Client+11.0') . do you think the Native Client has something to do with the disconnect?Pimbley
@j.k. I just wrote some boilerplate code around your solution, which actually does the work, so I have to thank you for the main effort! @ Cameron Taylor: not sure, but it seems likely, I hope you're able to try it with another driver.Boykin
@Boykin This is great. For some reason, I end up getting memory errors now when utilizing this new fast_executemany method as shown above. Would there be any reason this method would be more memory intensive?Marcello
@Marcello I've edited my answer to provide a possible solution to your memory issues. Short answer, to_sql is very memory intensive due to the transformation into a list type. (a 700MB RAM DataFrame seemed to use at peakload about 8GB RAM with the to_sql method due to this. )Boykin
@Boykin That is great work and performance profiling to help understand the issue. I have also posted an answer using the new turbodbc library that might be a good option as well.Marcello
df.to_sql works fine when if_exists is set replace. When I try to use append jupyter notebook crashes. Any solutions?Accustomed
Maybe you're trying to write too large chunks? Not sure, it's hard to judge without a full traceback. Though your question is not really related anymore so I think it's best if you make a new question or look around on SO for similar questions. Good luck.Boykin
It is only 8000 rows and 5 columns. I am using the exact same code you provided. It works fine with 'replace' but not with 'append', How is this not related?Accustomed
I think it is not related as the original question was regarding the speeding up of the method to_sql. You are asking now about an error of an argument in the same method, which isn't related anymore with the original question - afaik. Just trying to adhere to the norms of SO that I normally see. regarding the extra information you've provided now, perhaps the error is raised because the already present table is of a different size and thus cannot be appended to (type error)? Also the last code snippet I've provided was for illustration purposes, you probably need to alter it somewhat.Boykin
@Boykin I posted a question about the memory error here. #49888217Abortionist
Not sure why I haven't shared this before but here is the class I use often for getting dataframes in and out of a SQL database: gitlab.com/timelord/timelord/blob/master/timelord/utils/… Enjoy!Boykin
@Boykin could you please provide an example on how to use this class?Yellowgreen
@Yellowgreen I've updated the class with a proper example. Do note that the not every database will use the same driver and will thus raise an error when using this class. An example database that does not use this is PostgreSQL. I haven't found a fast way yet to insert data into PSQL. One way to still use this class like that is by explicitly turning the switch off by calling: con._init_engine(SET_FAST_EXECUTEMANY_SWITCH=False) After having initialized the class. Good luck.Boykin
@Boykin You are a true gelntleman! thanks!!!I had to add an self._init_engine() call to the _write_df method. However, using "db_type": "mssql+pymssql" is terrible slow. What db_type should I be using for a SQL Server?Yellowgreen
@Yellowgreen This class is intended to be used exactly with mssql! However as per my example use mssql+pyodbc. This will leverage the fast_insert switch ON that is set to True in the class as default. The class is just a fancy wrapper of both code snippets in my example.Boykin
I`ve tried pyodbc before. I'm trying it again with your class BUT I keep getting this error: ProgrammingError: ('The SQL contains 26464 parameter markers, but 92000 parameters were supplied', 'HY000') My df has 91 columns and 1000 rows.Yellowgreen
Have seen that error before but I can't recall the solution. Have you tried updating your libraries? Perhaps that'll do the trick. Iirc anaconda ships with an older pyodbc. Good luck.Boykin
I tried the event-based solution above and got the following error: sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][ODBC Driver 11 for SQL Server]Function sequence error (0) (SQLParamData)') [SQL:...... where the ..... is the insert statement generated by the pandas to_sql(). I'm using Python 3.7.0, Windows 7, and SQL Server 12.0.5538.0. Accessing via a named ODBC connection. (APOLOGIES: I posted this comment on another answer as well, just thought this thread seemed active too.)Godwin
I think your OBDC connector is outdated. I can remember getting similar issues with this. I think this solution only works for versions above 15. GoodluckBoykin
@Boykin any idea what I can use for Redshift without ODBC? psycopg2 doesnt support fast_executemanyAutoerotic
@Boykin - Since this is the most popular answer by far, please consider updating it to mention that SQLAlchemy 1.3.0, released 2019-03-04, now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute'). Thanks.Weariless
Thanks Gord Thompson for the update! I have set your comment to the top and also made a community wiki article out of my post for future updates.Boykin
I tried to set fast_executemany=True but got a memory error. I am trying to insert into an Azure Sql data warehouse. When I set the chunksize=1000, it was very slow.Barbarian
#Respect This is the only solution works for me, speeding up to_sql() of pandas writting on SQL Server. After thounsand of tests with many drivers and configurations... Thank you very much!!!Fairway
M
42

After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!

One has to use a cursor execution event and check if the executemany flag has been raised. If that is indeed the case, switch the fast_executemany option on. For example:

from sqlalchemy import event

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

More information on execution events can be found here.


UPDATE: Support for fast_executemany of pyodbc was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.

Mansour answered 2/1, 2018 at 17:10 Comment(11)
Thanks so much for doing the legwork on this. Just for clarity sake, this decorator and function should be declared before instantiating a SQLAlchemy engine?Marcello
You're most welcome. I declare it right after instantiating the engine in the constructor of a class.Mansour
so this removes the need for the pyodbc specific connection code? just need to call to_sql() after this function?Pimbley
i tried just calling to_sql directly after the function, but it didn't speed anything upPimbley
@CameronTaylor I am curious about this myself. I assumed this answer was related to the .to_sql() method too.Marcello
@Marcello I think it is, otherwise sqlalchemy wouldn't be used at all. Just not sure how to piece it togetherPimbley
@CameronTaylor this should indeed allow you to use only SQLA or related pandas methods. After declaring this listener, I just call to_sql() and observe the performance that is close to using pyODBC alone, but with all of the benefits of the engine and ORM of SQLA. What version of pyODBC are you using? Versions older than 4.0.19 do not have this feature.Mansour
Just wanted to let you all know that I have put together an example using the new turbodbc library which should be very fast and manage memory consumption better.Marcello
I tried the event-based solution above and got the following error: sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][ODBC Driver 11 for SQL Server]Function sequence error (0) (SQLParamData)') [SQL:...... where the ..... is the insert statement generated by the pandas to_sql(). I'm using Python 3.7.0, Windows 7, and SQL Server 12.0.5538.0. Accessing via a named ODBC connection. Any help appreciated - thanks!Godwin
Hey! Sorry, I might not be the person to help as I'm a casual user of SQL-related libraries... What version of pyodbc are you using? I've found some discussions on the topic here github.com/mkleehammer/pyodbc/issues/371 and here #49888217Mansour
@Mansour - Please consider updating your answer to mention that SQLAlchemy 1.3.0, released 2019-03-04, now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute'). Thanks.Weariless
L
24

I ran into the same problem but using PostgreSQL. They now just release pandas version 0.24.0 and there is a new parameter in the to_sql function called method which solved my problem.

from sqlalchemy import create_engine

engine = create_engine(your_options)
data_frame.to_sql(table_name, engine, method="multi")

Upload speed is 100x faster for me. I also recommend setting the chunksize parameter if you are going to send lots of data.

Lateritious answered 1/2, 2019 at 9:52 Comment(1)
According to pandas.pydata.org/pandas-docs/stable/user_guide/…, setting method='multi' is likely to slow down insertions on traditional RDBMS's when loading into tables with many columns, but tends to be more useful for environments like Redshift, when dealing with wide tables.Apfelstadt
M
12

I just wanted to post this full example as an additional, high-performance option for those who can use the new turbodbc library: http://turbodbc.readthedocs.io/en/latest/

There clearly are many options in flux between pandas .to_sql(), triggering fast_executemany through sqlalchemy, using pyodbc directly with tuples/lists/etc., or even trying BULK UPLOAD with flat files.

Hopefully, the following might make life a bit more pleasant as functionality evolves in the current pandas project or includes something like turbodbc integration in the future.

import pandas as pd
import numpy as np
from turbodbc import connect, make_options
from io import StringIO

test_data = '''id,transaction_dt,units,measures
               1,2018-01-01,4,30.5
               1,2018-01-03,4,26.3
               2,2018-01-01,3,12.7
               2,2018-01-03,3,8.8'''

df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])

options = make_options(parameter_sets_to_buffer=1000)
conn = connect(driver='{SQL Server}', server='server_nm', database='db_nm', turbodbc_options=options)

test_query = '''DROP TABLE IF EXISTS [db_name].[schema].[test]

                CREATE TABLE [db_name].[schema].[test]
                (
                    id int NULL,
                    transaction_dt datetime NULL,
                    units int NULL,
                    measures float NULL
                )

                INSERT INTO [db_name].[schema].[test] (id,transaction_dt,units,measures)
                VALUES (?,?,?,?) '''

cursor.executemanycolumns(test_query, [df_test['id'].values, df_test['transaction_dt'].values, df_test['units'].values, df_test['measures'].values]

turbodbc should be VERY fast in many use cases (particularly with numpy arrays). Please observe how straightforward it is to pass the underlying numpy arrays from the dataframe columns as parameters to the query directly. I also believe this helps prevent the creation of intermediate objects that spike memory consumption excessively. Hope this is helpful!

Marcello answered 21/2, 2018 at 19:30 Comment(6)
gonna try this in the next few days then I'll come back with my findingsYellowgreen
@Yellowgreen did this turnout to be helpful for you? it would be great to hear your findings hereMarcello
Hi Pylander! I didn't got the time for trying yet, pretty busy here. For now I'm using a tool from the company to ingest data. But I need this badly for the next projects, for ingesting massive data on sql server. The major downside I see is that my dfs have 240 columns each. When using pd.to_sql, I don't need to worry about every column. Then again, pd.to_sql is really slow, to the point of being prohibitive. Using turbodbc may be my solution, but having to manually type every one the those 240 columns seems not optimal for me (as there is a lot of different df to be ingested)Yellowgreen
I got it working: so cool! made me so excited that I wrote a 'blog' on it on my github: githubYellowgreen
@Yellowgreen That's great! I'm glad you found it worthwhile for your needs in the end and thanks for the linking your nice demo post. It should help promote this answer and raise the profile of the turbodbc project to folks looking for solutions too.Marcello
Can this work with pandas to_sql?Swordtail
F
7

It seems that Pandas 0.23.0 and 0.24.0 use multi values inserts with PyODBC, which prevents fast executemany from helping – a single INSERT ... VALUES ... statement is emitted per chunk. The multi values insert chunks are an improvement over the old slow executemany default, but at least in simple tests the fast executemany method still prevails, not to mention no need for manual chunksize calculations, as is required with multi values inserts. Forcing the old behaviour can be done by monkeypatching, if no configuration option is provided in the future:

import pandas.io.sql

def insert_statement(self, data, conn):
    return self.table.insert(), data

pandas.io.sql.SQLTable.insert_statement = insert_statement

The future is here and at least in the master branch the insert method can be controlled using the keyword argument method= of to_sql(). It defaults to None, which forces the executemany method. Passing method='multi' results in using the multi values insert. It can even be used to implement DBMS specific approaches, such as Postgresql COPY.

Freytag answered 23/8, 2018 at 17:59 Comment(3)
The pandas developers went back and forth on this issue for a while, but eventually they seemed to back away from the multi-row insert approach, at least for a mssql+pyodbc SQLAlchemy engine. pandas 0.23.4 does indeed let fast_executemany do its thing.Weariless
Haven't checked what is the current situation, but it was put back in in version 0.24.0. Edit: it's still there at least in master branch, but it is controllable now: github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1157. Seems like passing to_sql(..., method=None) should force the executemany approach.Marquetry
...And None is the default.Marquetry
Y
7

As pointed out by @Pylander

Turbodbc is the best choice for data ingestion, by far!

I got so excited about it that I wrote a 'blog' on it on my github and medium: please check https://medium.com/@erickfis/etl-process-with-turbodbc-1d19ed71510e

for a working example and comparison with pandas.to_sql

Long story short,

with turbodbc I've got 10000 lines (77 columns) in 3 seconds

with pandas.to_sql I've got the same 10000 lines (77 columns) in 198 seconds...

And here is what I'm doing in full detail

The imports:

import sqlalchemy
import pandas as pd
import numpy as np
import turbodbc
import time

Load and treat some data - Substitute my sample.pkl for yours:

df = pd.read_pickle('sample.pkl')

df.columns = df.columns.str.strip()  # remove white spaces around column names
df = df.applymap(str.strip) # remove white spaces around values
df = df.replace('', np.nan)  # map nans, to drop NAs rows and columns later
df = df.dropna(how='all', axis=0)  # remove rows containing only NAs
df = df.dropna(how='all', axis=1)  # remove columns containing only NAs
df = df.replace(np.nan, 'NA')  # turbodbc hates null values...

Create the table using sqlAlchemy

Unfortunately, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.

Fortunately, Python is pure joy and we can automate this process of writing sql code.

The first step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!

This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.

Create sqlAlchemy connection:

mydb = 'someDB'

def make_con(db):
    """Connect to a specified db."""
    database_connection = sqlalchemy.create_engine(
        'mssql+pymssql://{0}:{1}@{2}/{3}'.format(
            myuser, mypassword,
            myhost, db
            )
        )
    return database_connection

pd_connection = make_con(mydb)

Create table on SQL Server

Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.

table = 'testing'
df.head().to_sql(table, con=pd_connection, index=False)

Now that the table is already in place, let’s get serious here.

Turbodbc connection:

def turbo_conn(mydb):
    """Connect to a specified db - turbo."""
    database_connection = turbodbc.connect(
                                            driver='ODBC Driver 17 for SQL Server',
                                            server=myhost,
                                            database=mydb,
                                            uid=myuser,
                                            pwd=mypassword
                                        )
    return database_connection

Preparing sql comands and data for turbodbc. Let’s automate this code creation being creative:

def turbo_write(mydb, df, table):
    """Use turbodbc to insert data into sql."""
    start = time.time()
    # preparing columns
    colunas = '('
    colunas += ', '.join(df.columns)
    colunas += ')'

    # preparing value place holders
    val_place_holder = ['?' for col in df.columns]
    sql_val = '('
    sql_val += ', '.join(val_place_holder)
    sql_val += ')'

    # writing sql query for turbodbc
    sql = f"""
    INSERT INTO {mydb}.dbo.{table} {colunas}
    VALUES {sql_val}
    """

    # writing array of values for turbodbc
    valores_df = [df[col].values for col in df.columns]

    # cleans the previous head insert
    with connection.cursor() as cursor:
        cursor.execute(f"delete from {mydb}.dbo.{table}")
        connection.commit()

    # inserts data, for real
    with connection.cursor() as cursor:
        try:
            cursor.executemanycolumns(sql, valores_df)
            connection.commit()
        except Exception:
            connection.rollback()
            print('something went wrong')

    stop = time.time() - start
    return print(f'finished in {stop} seconds')

Writing data using turbodbc - I’ve got 10000 lines (77 columns) in 3 seconds:

turbo_write(mydb, df.sample(10000), table)

Pandas method comparison - I’ve got the same 10000 lines (77 columns) in 198 seconds…

table = 'pd_testing'

def pandas_comparisson(df, table):
    """Load data using pandas."""
    start = time.time()
    df.to_sql(table, con=pd_connection, index=False)
    stop = time.time() - start
    return print(f'finished in {stop} seconds')

pandas_comparisson(df.sample(10000), table)

Environment and conditions

Python 3.6.7 :: Anaconda, Inc.
TURBODBC version ‘3.0.0’
sqlAlchemy version ‘1.2.12’
pandas version ‘0.23.4’
Microsoft SQL Server 2014
user with bulk operations privileges

Please check https://erickfis.github.io/loose-code/ for updates in this code!

Yellowgreen answered 7/11, 2018 at 19:2 Comment(1)
I also found pandas slow, but for one project I solved it using a different approach. I had the data in multiple files (13 columns), but 1 million rows in total. Instead I used MySQL INFILE with the files stored locally. Calling it from python, using threading. I was able to import 1 million rows in ~20 seconds.Folsom
W
7

SQL Server INSERT performance: pyodbc vs. turbodbc

When using to_sql to upload a pandas DataFrame to SQL Server, turbodbc will definitely be faster than pyodbc without fast_executemany. However, with fast_executemany enabled for pyodbc, both approaches yield essentially the same performance.

Test environments:

[venv1_pyodbc]
pyodbc 2.0.25

[venv2_turbodbc]
turbodbc 3.0.0
sqlalchemy-turbodbc 0.1.0

[common to both]
Python 3.6.4 64-bit on Windows
SQLAlchemy 1.3.0b1
pandas 0.23.4
numpy 1.15.4

Test code:

# for pyodbc
engine = create_engine('mssql+pyodbc://sa:whatever@SQL_panorama', fast_executemany=True)
# for turbodbc
# engine = create_engine('mssql+turbodbc://sa:whatever@SQL_panorama')

# test data
num_rows = 10000
num_cols = 100
df = pd.DataFrame(
    [[f'row{x:04}col{y:03}' for y in range(num_cols)] for x in range(num_rows)],
    columns=[f'col{y:03}' for y in range(num_cols)]
)

t0 = time.time()
df.to_sql("sqlalchemy_test", engine, if_exists='replace', index=None)
print(f"pandas wrote {num_rows} rows in {(time.time() - t0):0.1f} seconds")

Tests were run twelve (12) times for each environment, discarding the single best and worst times for each. Results (in seconds):

   rank  pyodbc  turbodbc
   ----  ------  --------
      1    22.8      27.5
      2    23.4      28.1
      3    24.6      28.2
      4    25.2      28.5
      5    25.7      29.3
      6    26.9      29.9
      7    27.0      31.4
      8    30.1      32.1
      9    33.6      32.5
     10    39.8      32.9
   ----  ------  --------
average    27.9      30.0
Weariless answered 1/1, 2019 at 20:29 Comment(0)
A
4

Just wanted to add to the @J.K.'s answer.

If you are using this approach:

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

And you are getting this error:

"sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('HY010', '[HY010] [Microsoft][SQL Server Native Client 11.0]Function sequence error (0) (SQLParamData)') [SQL: 'INSERT INTO ... (...) VALUES (?, ?)'] [parameters: ((..., ...), (..., ...)] (Background on this error at: http://sqlalche.me/e/dbapi)"

Encode your string values like this: 'yourStringValue'.encode('ascii')

This will solve your problem.

Adrianeadrianna answered 22/2, 2019 at 1:36 Comment(0)
P
0

I just modify engine line which helps me to speedup the insertion 100 times.

Old Code -

import json
import maya
import time
import pandas
import pyodbc
import pandas as pd
from sqlalchemy import create_engine

retry_count = 0
retry_flag = True

hostInfoDf = pandas.read_excel('test.xlsx', sheet_name='test')
print("Read Ok")

engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")

while retry_flag and retry_count < 5:
  try:
    df.to_sql("table_name",con=engine,if_exists="replace",index=False,chunksize=5000,schema="dbo")
    retry_flag = False
  except:
    retry_count = retry_count + 1
    time.sleep(30)

Modified engine line -

From -

engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")

to -

engine = create_engine("mssql+pyodbc://server_name/db_name?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server", fast_executemany=True)

ask me any Query related python to SQL connectivity, I will be happy to help you.

Pytlik answered 30/3, 2020 at 6:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.