pythonpandasisin

isin not working for searching a subset of columns for unique values


I have a data frame that have repetitions of some multiple column value combinations. I'm trying to extract the columns that have more than two repetitions. I have the following code

d = {'main_id': [1, 1, 1, 2, 2, 3, 3, 3], 'type': [1, 1, 1, 1, 1, 2, 2, 2], 'instant': [1, 2, 3, 1, 2, 1, 2, 3]}
df = pd.DataFrame(data=d)

uniqueCombinations = df.groupby(['main_id', 'type']).size().reset_index().rename(columns={0: 'Count'})
uniqueCombinations = uniqueCombinations.loc[uniqueCombinations['Count'] >= 3]
uniqueCombinations = uniqueCombinations.reset_index()
compare1 = df[['main_id', 'type']]
compare2 = uniqueCombinations[['main_id', 'type']]
mask = (compare1.isin(compare2)).all(axis=1)

What I do is to extract the rows that have [1,1] and [3,2] in the main_id and type columns in the df data frame. It's only matching only one row. Where am I doing wrong?


Solution

  • Your compare1 and compare2 datasets include the index of each row as a column. So a value in compare1 is only going to match up with a value in compare2 if it is in the same position in the dataset as well as having the two column values match. You can see this in the debugger. Here's the state of these variables just before the line containing the isin:

    compare1: [0 1 1] [1 1 1] [2 1 1] [3 2 1] [4 2 1] [5 3 2] [6 3 2] [7 3 2]
     
    compare2: [0 1 1] [1 3 2]
    

    So what you need to do is modify your code so that the first column value, the index of each row, is not part of the comparison.