Pandas to_sql make index unique
Asked Answered
Z

1

6

I have been readin about pandas to_sql solutions to not add duplicate records to a database. I am working with csv files of logs, each time i upload a new log file i then read the data and make some changes with pandas creating a new dataframe. Then i execute to_sql('Logs',con = db.engine, if_exists = 'append', index=True). With the if_exists arg i make sure each time the new created dataframe from the new file is appended to the existing database. The problem is it keeps adding duplicating values. I want to make sure that if a file which has already been uploaded is by mistake uploaded again it won't be appended to the database. I want to try do this directly when creating the database withouth finding a workaround like just checking if the filename has been used before.

I am working with flask-sqlalchemy.

Thank you.

Zootechnics answered 2/9, 2017 at 18:11 Comment(4)
There's a recent discussion about adding upserts to Pandas here. TL;DR - currently considered out of scope for Pandas as it gets tricky to remain database-agnostic. (Replacing an entry with its duplicate is a kind of upsert.)Teetotalism
Is there a way not to replace the entry but just ignore the dataframe when is a duplicate? The log files are generated each month. Really i only care not to have a dataframe that has already been added to the database re appended in case someone uploads by mistake the same file twice. I saw in another post that a possible solution was to use sqlite3.IntegrityError but that didn't work for me.Zootechnics
For future readers: A solution I've used for a couple of years now, although slow, works nicely - is to iterate over the DataFrame (yes, I know ...) and try to insert each row using to_sql. In the except block, test for '1062' to be in the error output, as this indicates a duplicate.Convery
You can also let the database engine do its job of checking for uniqueness by creating a table structure and specifying the unique conditions in the database structure before using pandas to_sql.Walkerwalkietalkie
T
4

Your best bet is to catch duplicates by setting up your index as a primary key, and then using try/except to catch uniqueness violations. You mentioned another post that suggested watching for IntegrityError exceptions, and I agree that's the best approach. You can combine that with a de-deuplication function to make sure your table updates run smoothly.

Demonstrating the problem

Here's a toy example:

from sqlalchemy import *
import sqlite3

# make a database, 'test', and a table, 'foo'.
conn = sqlite3.connect("test.db")
c = conn.cursor()
# id is a primary key.  this will be the index column imported from to_sql().
c.execute('CREATE TABLE foo (id integer PRIMARY KEY, foo integer NOT NULL);')
# use the sqlalchemy engine.
engine = create_engine('sqlite:///test.db')

pd.read_sql("pragma table_info(foo)", con=engine)

   cid name     type  notnull dflt_value  pk
0    0   id  integer        0       None   1
1    1  foo  integer        1       None   0

Now, two example data frames, df and df2:

data = {'foo':[1,2,3]}
df = pd.DataFrame(data)
df
   foo
0    1
1    2
2    3

data2 = {'foo':[3,4,5]}
df2 = pd.DataFrame(data2, index=[2,3,4])
df2
   foo
2    3       # this row is a duplicate of df.iloc[2,:]
3    4
4    5

Move df into table foo:

df.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')

pd.read_sql('foo', con=engine)
   id  foo
0   0    1
1   1    2
2   2    3

Now, when we try to append df2, we catch the IntegrityError:

try:
    df2.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
# use the generic Exception, both IntegrityError and sqlite3.IntegrityError caused trouble.
except Exception as e: 
    print("FAILURE TO APPEND: {}".format(e))

Output:

FAILURE TO APPEND: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Suggested Solution

On IntegrityError, you can pull the existing table data, remove the duplicate entries of your new data, and then retry the append statement. Use apply() for this:

def append_db(data):
    try:
        data.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
        return 'Success'
    except Exception as e:
        print("Initial failure to append: {}\n".format(e))
        print("Attempting to rectify...")
        existing = pd.read_sql('foo', con=engine)
        to_insert = data.reset_index().rename(columns={'index':'id'})
        mask = ~to_insert.id.isin(existing.id)
        try:
            to_insert.loc[mask].to_sql('foo', con=engine, index=False, if_exists='append')
            print("Successful deduplication.")
        except Exception as e2:
            "Could not rectify duplicate entries. \n{}".format(e2)
        return 'Success after dedupe'

df2.apply(append_db)

Output:

Initial failure to append: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Attempting to rectify...
Successful deduplication.

foo    Success after dedupe
dtype: object
Teetotalism answered 2/9, 2017 at 21:23 Comment(3)
Thank you for your reply however in the post that mentioned IntergrityError as solution it didn't required any extra steps. After all i really want to avoid having to create temporary databases. I am working with Flask-SQL Alchemy at first i thought by defining also the Model and setting there my index as primary key it would work but didn't (i guess after all the table in models and the table crated by pandas is different). Isn't there a way to set my primary key directly with pandas, or a solution with SQLAlchemy?Zootechnics
You cannot set schema details with Pandas, although you can specify an existing schema with the schema argument. You did not specify why catching the IntegrityError didn't work, which is why I demonstrated a solution where it does work. This solution does use SQLAlchemy...I'm afraid I'm a bit unclear as to exactly what your problem is. Please consider updating your original post with a proper MCVE.Teetotalism
Please have a look at the MCVE guidelines - the example code you posted isn't minimum, complete, or verifiable. If you correctly specify a primary key in your table, you will not get duplicates, but you will get an error when you use to_sql. The form of my solution doesn't create a temporary database, but it does check against existing entries to find duplicates. I'm not sure you'll be able to get around that step.Teetotalism

© 2022 - 2024 — McMap. All rights reserved.