How to get autoincrement values for a column after uploading a Pandas dataframe to a MySQL database
Asked Answered
T

2

14

I have a Pandas DataFrame (called df), which I would like to upload to a MySql database. The dataframe has columns [A, B, C] and the table in the database has columns [ID, A, B, C]. The ID column in the database is the auto-incrementing primary key.

I can upload the dataframe to the database using the df.to_sql('table_name', engine) command. However, this does not give me any information about the values that the database assigned to the ID column of the incoming data. The only way I have of getting this information is by querying the database using the values for columns A, B, C:

select 
ID, A, B, C 
from db_table 
where (A, B, C) in ((x1, y1, z1), (x2, y2, z2), ...) 

However, this query takes a very long time when I am inserting a lot of data.

Is there a simpler and quicker way of getting the values that the database assigned to the ID column of the incoming data?

Edit 1: I can assign the ID column myself, as per user3364098's answer below. However, my job is part of a pipeline that is ran in parallel. If I assign the ID column myself, there is a chance that I may assign the same id values to different dataframes that are uploaded at the same time. This is why I would like to relegate the ID assignment task to the database.

Solution: I ended up assigning the ID column myself, and issuing a lock on the table while uploading the data in order to guarantee that no other process uploads data with the same id value. Basically:

try:
    engine.execute('lock tables `table_name` write')
    max_id_query = 'select max(ID) FROM `table_name`'
    max_id = int(pd.read_sql_query(max_id_query, engine).values)
    df['ID'] = range(max_id + 1, max_id + len(df) + 1)
    df.to_sql('table_name', engine, if_exists='append', index=False)
finally:
    engine.execute('unlock tables')
Tyson answered 6/11, 2014 at 1:55 Comment(0)
A
11

You can assign id by yourself:

import pandas as pd
df['ID'] = pd.read_sql_query('select ifnull(max(id),0)+1 from db_table',cnx).iloc[0,0]+range(len(df))

where cnx is your connection and then upload your df.

Anticosti answered 6/11, 2014 at 13:8 Comment(3)
If your database is SQL Server, use isnull instead of ifnull w3schools.com/sql/sql_isnull.aspBoccherini
If the 'ID' is not a column you want to insert, you can assign the right half of the above answer to a variable and then assign it to a pandas column. I did on postgres: ids = pd.read_sql_query('select COALESCE(max(id),0)+1 from poi', conn).iloc[0,0]+range(len(pois)) # pois['id'] = idsLitho
for postgresql: pd.read_sql_query('select COALESCE(max(id), 0)+1 as max_id from table_name' , dbConnection)Damnatory
A
-1
import pandas as pd
df['ID'] = pd.read_sql_query('select MAX(ID)+1 from db_table',cnx).iloc[0,0] + range(len(df))
Anatomist answered 8/8, 2016 at 8:4 Comment(1)
Could you explain what your code does, and why it is the solution to OP's problem?Jubilation

© 2022 - 2024 — McMap. All rights reserved.