pythonpandasdataframemergeanti-join

Anti-Join Pandas


I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")

## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")

## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?


Solution

  • Consider the following dataframes

    TableA = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('abcd'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    TableB = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('aecf'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    

    TableA
    

    enter image description here


    TableB
    

    enter image description here

    This is one way to do what you want

    Method 1

    # Identify what values are in TableB and not in TableA
    key_diff = set(TableB.Key).difference(TableA.Key)
    where_diff = TableB.Key.isin(key_diff)
    
    # Slice TableB accordingly and append to TableA
    TableA.append(TableB[where_diff], ignore_index=True)
    

    enter image description here

    Method 2

    rows = []
    for i, row in TableB.iterrows():
        if row.Key not in TableA.Key.values:
            rows.append(row)
    
    pd.concat([TableA.T] + rows, axis=1).T
    

    Timing

    4 rows with 2 overlap

    Method 1 is much quicker

    enter image description here

    10,000 rows 5,000 overlap

    loops are bad

    enter image description here