I have a data frame that have repetitions of some multiple column value combinations. I'm trying to extract the columns that have more than two repetitions. I have the following code
d = {'main_id': [1, 1, 1, 2, 2, 3, 3, 3], 'type': [1, 1, 1, 1, 1, 2, 2, 2], 'instant': [1, 2, 3, 1, 2, 1, 2, 3]}
df = pd.DataFrame(data=d)
uniqueCombinations = df.groupby(['main_id', 'type']).size().reset_index().rename(columns={0: 'Count'})
uniqueCombinations = uniqueCombinations.loc[uniqueCombinations['Count'] >= 3]
uniqueCombinations = uniqueCombinations.reset_index()
compare1 = df[['main_id', 'type']]
compare2 = uniqueCombinations[['main_id', 'type']]
mask = (compare1.isin(compare2)).all(axis=1)
What I do is to extract the rows that have [1,1]
and [3,2]
in the main_id
and type
columns in the df data frame. It's only matching only one row. Where am I doing wrong?
Your compare1
and compare2
datasets include the index of each row as a column. So a value in compare1
is only going to match up with a value in compare2
if it is in the same position in the dataset as well as having the two column values match. You can see this in the debugger. Here's the state of these variables just before the line containing the isin
:
compare1: [0 1 1] [1 1 1] [2 1 1] [3 2 1] [4 2 1] [5 3 2] [6 3 2] [7 3 2]
compare2: [0 1 1] [1 3 2]
So what you need to do is modify your code so that the first column value, the index of each row, is not part of the comparison.