Why would I get a memory error with fast_executemany on a tiny df?
Asked Answered
C

4

13

I was looking for ways to speed up pushing a dataframe to sql server and stumbled upon an approach here. This approach blew me away in terms of speed. Using normal to_sql took almost 2 hours and this script was done in 12.54 seconds to push a 100k row X 100 column df.

So after testing the code below with a sample df, I attempted to use a df that had many different datatypes (int, string, floats, Booleans). However, I was sad to see a memory error. So I started reducing the size of my df to to see what the limitations were. I noticed that if my df had any strings then I wasn't able to load to sql server. I am having trouble isolating the issue further. The script below is taken from the question in the link, however, I added a tiny df with strings. Any suggestions on how to rectify this issue would be great!

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
                   'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
                   })



s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
Carpeting answered 17/4, 2018 at 21:24 Comment(0)
T
29

I was able to reproduce your issue using pyodbc 4.0.23. The MemoryError was related to your use of the ancient

DRIVER={SQL Server}

Further testing using

DRIVER=ODBC Driver 11 for SQL Server

also failed, with

Function sequence error (0) (SQLParamData)

which was related to an existing pyodbc issue on GitHub. I posted my findings here.

That issue is still under investigation. In the meantime you might be able to proceed by

  • using a newer ODBC driver like DRIVER=ODBC Driver 13 for SQL Server, and
  • running pip install pyodbc==4.0.22 to use an earlier version of pyodbc.
Terr answered 20/4, 2018 at 1:3 Comment(2)
Thanks! pyodbc 4.0.22 didn't work for me but pyodbc 4.0.19 worked fine. Also, I noticed that the script throws an error when adding datatypes into the to_sql statement. So i just load the data to SQL Server and change the datatypes there, in case anyone is getting that same error.Carpeting
Thanks @MartinBobak It's such a strange error..! I had to fiddle around with the metadata and it worked, but very annoying. Also found the performance boost amazing.Krummhorn
L
2

I was facing thi issue too.

Solved by changing:

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"

To this:

conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
Leveller answered 16/9, 2021 at 16:41 Comment(0)
M
2

I have bumped into this exact issue a bunch of times, and I'm not seeing the answer that always solves it for me, so here we go. It turns out that passing the variable types is very important when using fast_executemany=True.

Particularly, you will run into a memory error with string fields if you do not pass a maximum number of characters in your type to to_sql. What seems to be going on is that if you don't include a maximum number of characters that can be entered into the string field, sqlalchemy will devote as much memory as it can to each row in the column. So if you have a sizeable amount of data you quickly run out of memory on your machine. The fix is to pass a dictionary to the dtype parameter.

Here is the a snippet to give you an idea what I'm talking about:

conn = 'mssql+pyodbc://server/db?driver=SQL+SERVER'
engine = create_engine(conn,fast_executemany=True)
typing = {
          'col1':sqlalchemy.types.String(5),
          'col2':sqlalchemy.types.String(5),
          }
df.to_sql('Table', engine, dtype=typing)

I've also found that for large database entries passing a chunksize parameter is also helpful, because otherwise my server times out, but that is a problem for another answer, on another day.

Mat answered 10/1, 2023 at 0:2 Comment(2)
This helped me find my solution. Instead of using text as the datatype for one of the columns, I used varchar200. Of course, this required deleting the table and creating a new one, which I was able to do.Cariotta
Glad this helped out. Sucks that you had to delete the whole table instead of dropping just the offending columns.Mat
A
1

I experienced this issue with 32-bit Python and switched my interpreter to 64-bit which resolved my memory issue. Outside of that solution, I would recommend chunking the amount of data you process. You can establish your threshold and once that threshold is met you can process that chunk of data and iterate through until you have processed all of the data.

Adur answered 20/4, 2021 at 1:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.