I was looking for ways to speed up pushing a dataframe to sql server and stumbled upon an approach here. This approach blew me away in terms of speed. Using normal to_sql
took almost 2 hours and this script was done in 12.54 seconds to push a 100k row X 100 column df.
So after testing the code below with a sample df, I attempted to use a df that had many different datatypes (int, string, floats, Booleans). However, I was sad to see a memory error. So I started reducing the size of my df to to see what the limitations were. I noticed that if my df had any strings then I wasn't able to load to sql server. I am having trouble isolating the issue further. The script below is taken from the question in the link, however, I added a tiny df with strings. Any suggestions on how to rectify this issue would be great!
import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc
conn = "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
print("FUNC call")
if executemany:
cursor.fast_executemany = True
table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
})
s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
to_sql
statement. So i just load the data to SQL Server and change the datatypes there, in case anyone is getting that same error. – Carpeting