pandasuniqueprimary-keyflask-sqlalchemypandas-to-sql

Pandas to_sql make index unique


I have been readin about pandas to_sql solutions to not add duplicate records to a database. I am working with csv files of logs, each time i upload a new log file i then read the data and make some changes with pandas creating a new dataframe. Then i execute to_sql('Logs',con = db.engine, if_exists = 'append', index=True). With the if_exists arg i make sure each time the new created dataframe from the new file is appended to the existing database. The problem is it keeps adding duplicating values. I want to make sure that if a file which has already been uploaded is by mistake uploaded again it won't be appended to the database. I want to try do this directly when creating the database withouth finding a workaround like just checking if the filename has been used before.

I am working with flask-sqlalchemy.

Thank you.


Solution

  • Your best bet is to catch duplicates by setting up your index as a primary key, and then using try/except to catch uniqueness violations. You mentioned another post that suggested watching for IntegrityError exceptions, and I agree that's the best approach. You can combine that with a de-deuplication function to make sure your table updates run smoothly.

    Demonstrating the problem

    Here's a toy example:

    from sqlalchemy import *
    import sqlite3
    
    # make a database, 'test', and a table, 'foo'.
    conn = sqlite3.connect("test.db")
    c = conn.cursor()
    # id is a primary key.  this will be the index column imported from to_sql().
    c.execute('CREATE TABLE foo (id integer PRIMARY KEY, foo integer NOT NULL);')
    # use the sqlalchemy engine.
    engine = create_engine('sqlite:///test.db')
    
    pd.read_sql("pragma table_info(foo)", con=engine)
    
       cid name     type  notnull dflt_value  pk
    0    0   id  integer        0       None   1
    1    1  foo  integer        1       None   0
    

    Now, two example data frames, df and df2:

    data = {'foo':[1,2,3]}
    df = pd.DataFrame(data)
    df
       foo
    0    1
    1    2
    2    3
    
    data2 = {'foo':[3,4,5]}
    df2 = pd.DataFrame(data2, index=[2,3,4])
    df2
       foo
    2    3       # this row is a duplicate of df.iloc[2,:]
    3    4
    4    5
    

    Move df into table foo:

    df.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
    
    pd.read_sql('foo', con=engine)
       id  foo
    0   0    1
    1   1    2
    2   2    3
    

    Now, when we try to append df2, we catch the IntegrityError:

    try:
        df2.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
    # use the generic Exception, both IntegrityError and sqlite3.IntegrityError caused trouble.
    except Exception as e: 
        print("FAILURE TO APPEND: {}".format(e))
    

    Output:

    FAILURE TO APPEND: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]
    

    Suggested Solution

    On IntegrityError, you can pull the existing table data, remove the duplicate entries of your new data, and then retry the append statement. Use apply() for this:

    def append_db(data):
        try:
            data.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
            return 'Success'
        except Exception as e:
            print("Initial failure to append: {}\n".format(e))
            print("Attempting to rectify...")
            existing = pd.read_sql('foo', con=engine)
            to_insert = data.reset_index().rename(columns={'index':'id'})
            mask = ~to_insert.id.isin(existing.id)
            try:
                to_insert.loc[mask].to_sql('foo', con=engine, index=False, if_exists='append')
                print("Successful deduplication.")
            except Exception as e2:
                "Could not rectify duplicate entries. \n{}".format(e2)
            return 'Success after dedupe'
    
    df2.apply(append_db)
    

    Output:

    Initial failure to append: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]
    
    Attempting to rectify...
    Successful deduplication.
    
    foo    Success after dedupe
    dtype: object