how to transform pandas dataframe for insertion via executemany() statement?
Asked Answered
G

4

19

I have a fairly big pandas dataframe - 50 or so headers and a few hundred thousand rows of data - and I'm looking to transfer this data to a database using the ceODBC module. Previously I was using pyodbc and using a simple execute statement in a for loop but this was taking ridiculously long (1000 records per 10 minutes)...

I'm now trying a new module and am trying to introduce executemany() although I'm not quite sure what's meant by sequence of parameters in:

    cursor.executemany("""insert into table.name(a, b, c, d, e, f) 
values(?, ?, ?, ?, ?), sequence_of_parameters)

should it look like a constant list working through each header like

    ['asdas', '1', '2014-12-01', 'true', 'asdasd', 'asdas', '2', 
'2014-12-02', 'true', 'asfasd', 'asdfs', '3', '2014-12-03', 'false', 'asdasd']
  • where this is an example of three rows

or what is the format that's needed?

as another related question, how then can I go about converting a regular pandas dataframe to this format?

Thanks!

Gloze answered 29/4, 2015 at 8:44 Comment(3)
How are the dfs related? Are they all subsets of the same larger df? You could merge, concat them all and write to DB in one go perhapsMarieann
Hi Ed. It's all one singular dataframe, I assumed they had to be spilt into subsets of all the different headers to work within the query... so 'a', 'b', 'c' and so on are there to signify different column headers for the dataframeMarseilles
Did you try pymssql? (but I don't have experience if this is faster as pyodbc) But with that you can just use to_sql pandas method (the driver has to be supported by sqlalchemy to be able to use this with pandas)Reading
G
8

I managed to figure this out in the end. So if you have a Pandas Dataframe which you want to write to a database using ceODBC which is the module I used, the code is:

(with all_data as the dataframe) map dataframe values to string and store each row as a tuple in a list of tuples

for r in all_data.columns.values:
    all_data[r] = all_data[r].map(str)
    all_data[r] = all_data[r].map(str.strip)   
tuples = [tuple(x) for x in all_data.values]

for the list of tuples, change all null value signifiers - which have been captured as strings in conversion above - into a null type which can be passed to the end database. This was an issue for me, might not be for you.

string_list = ['NaT', 'nan', 'NaN', 'None']

def remove_wrong_nulls(x):
    for r in range(len(x)):
        for i,e in enumerate(tuples):
            for j,k in enumerate(e):
                if k == x[r]:
                    temp=list(tuples[i])
                    temp[j]=None
                    tuples[i]=tuple(temp)

remove_wrong_nulls(string_list)

create a connection to the database

cnxn=ceODBC.connect('DRIVER={SOMEODBCDRIVER};DBCName=XXXXXXXXXXX;UID=XXXXXXX;PWD=XXXXXXX;QUIETMODE=YES;', autocommit=False)
cursor = cnxn.cursor()

define a function to turn the list of tuples into a new_list which is a further indexing on the list of tuples, into chunks of 1000. This was necessary for me to pass the data to the database whose SQL Query could not exceed 1MB.

def chunks(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

new_list = chunks(tuples, 1000)

define your query.

query = """insert into XXXXXXXXXXXX("XXXXXXXXXX", "XXXXXXXXX", "XXXXXXXXXXX") values(?,?,?)"""

Run through the the new_list containing the list of tuples in groups of 1000 and perform executemany. Follow this by committing and closing the connection and that's it :)

for i in range(len(new_list)):
    cursor.executemany(query, new_list[i])
cnxn.commit()
cnxn.close()
Gloze answered 13/5, 2015 at 11:9 Comment(1)
map(str) seems a little heavy-handed; is there a more direct way?Russel
R
35

You can try this:

cursor.executemany(sql_str, your_dataframe.values.tolist())

Hope it helps.

Rhymester answered 12/5, 2015 at 8:37 Comment(1)
Could you please elaborate more your answer adding a little more description about the solution you provide?Hirudin
G
8

I managed to figure this out in the end. So if you have a Pandas Dataframe which you want to write to a database using ceODBC which is the module I used, the code is:

(with all_data as the dataframe) map dataframe values to string and store each row as a tuple in a list of tuples

for r in all_data.columns.values:
    all_data[r] = all_data[r].map(str)
    all_data[r] = all_data[r].map(str.strip)   
tuples = [tuple(x) for x in all_data.values]

for the list of tuples, change all null value signifiers - which have been captured as strings in conversion above - into a null type which can be passed to the end database. This was an issue for me, might not be for you.

string_list = ['NaT', 'nan', 'NaN', 'None']

def remove_wrong_nulls(x):
    for r in range(len(x)):
        for i,e in enumerate(tuples):
            for j,k in enumerate(e):
                if k == x[r]:
                    temp=list(tuples[i])
                    temp[j]=None
                    tuples[i]=tuple(temp)

remove_wrong_nulls(string_list)

create a connection to the database

cnxn=ceODBC.connect('DRIVER={SOMEODBCDRIVER};DBCName=XXXXXXXXXXX;UID=XXXXXXX;PWD=XXXXXXX;QUIETMODE=YES;', autocommit=False)
cursor = cnxn.cursor()

define a function to turn the list of tuples into a new_list which is a further indexing on the list of tuples, into chunks of 1000. This was necessary for me to pass the data to the database whose SQL Query could not exceed 1MB.

def chunks(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

new_list = chunks(tuples, 1000)

define your query.

query = """insert into XXXXXXXXXXXX("XXXXXXXXXX", "XXXXXXXXX", "XXXXXXXXXXX") values(?,?,?)"""

Run through the the new_list containing the list of tuples in groups of 1000 and perform executemany. Follow this by committing and closing the connection and that's it :)

for i in range(len(new_list)):
    cursor.executemany(query, new_list[i])
cnxn.commit()
cnxn.close()
Gloze answered 13/5, 2015 at 11:9 Comment(1)
map(str) seems a little heavy-handed; is there a more direct way?Russel
D
2

Might be a little late to answer this question, but maybe it can still help someone. executemany() is not implemented by many ODBC. One of the ones that does have it is MySQL. When they refer to sequence of parameters they mean:

parameters=[{'name':'Jorge', 'age':22, 'sex':'M'}, 
            {'name':'Karen', 'age':25, 'sex':'F'}, 
            {'name':'James', 'age':29, 'sex':'M'}]

and for a query statement it would look something like:

SQL = INSERT IGNORE INTO WORKERS (NAME, AGE, SEX) VALUES (%(name)s, %(age)s, %(sex)s)

Which looks like you got there. A couple things though I want to point out in case it helps: pandas has a to_sql function that inserts into a db if you provide it the connector object, and chunks the data as well.

To rapidly create a sequence of parameters from a pandas dataframe I found the following two methods helpful:

# creates list of dict, list of parameters
# REF: https://groups.google.com/forum/#!topic/pydata/qna3Z3WmVpM
parameters = [df.iloc[line, :].to_dict() for line in range(len(df))]

# Cleaner Way
parameters = df.to_dict(orient='records')
Deste answered 9/8, 2015 at 17:20 Comment(3)
Using your last method of creating the list of dicts causes the column names to be sorted alphabetically inside each dict. Any idea how to sort those manually?Anett
you mean the keys in the dictionaries are sorted alphabetically? dictionary keys aren't guaranteed to be ordered. If you want them to be ordered you'll need to use an OrderedDict. df.to_dict(into=OrderedDict)Deste
Hi, i need little more explanation here, i have list (as "parameter" list define by you above) now i have to store this value to MySQL table name "User", can you give me way to store this to table with columns name, age, and sex in FlaskBrew
B
1

Generalizing a bit to ensure the same column order of the the query and the dataframe:

columns = ','.join(df.columns)
values=','.join([':{:d}'.format(i+1) for i in range(len(df.columns))])

sql = 'INSERT INTO table.name({columns:}) VALUES ({values:})'
cursor.executemany(sql.format(columns=columns, values=values), df.values.tolist())
Byrle answered 19/2, 2021 at 2:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.