As pointed out by @Pylander
Turbodbc is the best choice for data ingestion, by far!
I got so excited about it that I wrote a 'blog' on it on my github and medium:
please check https://medium.com/@erickfis/etl-process-with-turbodbc-1d19ed71510e
for a working example and comparison with pandas.to_sql
Long story short,
with turbodbc
I've got 10000 lines (77 columns) in 3 seconds
with pandas.to_sql
I've got the same 10000 lines (77 columns) in 198 seconds...
And here is what I'm doing in full detail
The imports:
import sqlalchemy
import pandas as pd
import numpy as np
import turbodbc
import time
Load and treat some data - Substitute my sample.pkl for yours:
df = pd.read_pickle('sample.pkl')
df.columns = df.columns.str.strip() # remove white spaces around column names
df = df.applymap(str.strip) # remove white spaces around values
df = df.replace('', np.nan) # map nans, to drop NAs rows and columns later
df = df.dropna(how='all', axis=0) # remove rows containing only NAs
df = df.dropna(how='all', axis=1) # remove columns containing only NAs
df = df.replace(np.nan, 'NA') # turbodbc hates null values...
Create the table using sqlAlchemy
Unfortunately, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.
Fortunately, Python is pure joy and we can automate this process of writing sql code.
The first step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!
This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.
Create sqlAlchemy connection:
mydb = 'someDB'
def make_con(db):
"""Connect to a specified db."""
database_connection = sqlalchemy.create_engine(
'mssql+pymssql://{0}:{1}@{2}/{3}'.format(
myuser, mypassword,
myhost, db
)
)
return database_connection
pd_connection = make_con(mydb)
Create table on SQL Server
Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.
table = 'testing'
df.head().to_sql(table, con=pd_connection, index=False)
Now that the table is already in place, let’s get serious here.
Turbodbc connection:
def turbo_conn(mydb):
"""Connect to a specified db - turbo."""
database_connection = turbodbc.connect(
driver='ODBC Driver 17 for SQL Server',
server=myhost,
database=mydb,
uid=myuser,
pwd=mypassword
)
return database_connection
Preparing sql comands and data for turbodbc. Let’s automate this code creation being creative:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
colunas = '('
colunas += ', '.join(df.columns)
colunas += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {colunas}
VALUES {sql_val}
"""
# writing array of values for turbodbc
valores_df = [df[col].values for col in df.columns]
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
try:
cursor.executemanycolumns(sql, valores_df)
connection.commit()
except Exception:
connection.rollback()
print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
Writing data using turbodbc - I’ve got 10000 lines (77 columns) in 3 seconds:
turbo_write(mydb, df.sample(10000), table)
Pandas method comparison - I’ve got the same 10000 lines (77 columns) in 198 seconds…
table = 'pd_testing'
def pandas_comparisson(df, table):
"""Load data using pandas."""
start = time.time()
df.to_sql(table, con=pd_connection, index=False)
stop = time.time() - start
return print(f'finished in {stop} seconds')
pandas_comparisson(df.sample(10000), table)
Environment and conditions
Python 3.6.7 :: Anaconda, Inc.
TURBODBC version ‘3.0.0’
sqlAlchemy version ‘1.2.12’
pandas version ‘0.23.4’
Microsoft SQL Server 2014
user with bulk operations privileges
Please check https://erickfis.github.io/loose-code/ for updates in this code!