A contains columns x, y1, y2, y3 & y4. I am interested in studying the {y1,y2,y3,y4} sequence w.r.t x. To find unique {y1,y2,y3,y4} sequence occurring for each x, I do the following:
B = pd.DataFrame()
for x_temp in A['x'].unique():
B = pd.concat([B, A[A['x'] == x_temp][['x','y1','y2','y3','y4']]])
B = B.drop_duplicates().sort_values(by=['x','y1','y2','y3','y4'])
del x_temp
I want to introduce a new column called 'count' in B, which contains # of unique {y1,y2,y3,y4} that occurred for that specific x in A.
B['count'] = A.apply(lambda row: (A['y1'] == row['y1']) & (A['y2'] == row['y2']) & (A['y3'] == row['y3']) & (A['y4'] == row['y4']), axis=1).sum()
This works, however, it doesn't work if A or B has missing values. I want it to treat missing values also as a unique value.
Example:
A = pd.DataFrame({'x':['1','1','1','1','2','2','2','2','2','1'],
'y1':['1','2','2',np.nan,'2',np.nan,'2','2','2','1'],
'y2':['2','1','2','2',np.nan,np.nan,'2','2','1','2'],
'y3':['1','1',np.nan,'2',np.nan,'2',np.nan,np.nan,'1','1'],
'y4':['2','2','2',np.nan,np.nan,'1','2','2','2','2']})
B = pd.DataFrame()
for x_temp in A['x'].unique():
B = pd.concat([B, A[A['x'] == x_temp][['x','y1','y2','y3','y4']]])
B = B.drop_duplicates().sort_values(by=['x','y1','y2','y3','y4'])
del x_temp
B['count'] = A.apply(lambda row: (A['y1'] == row['y1']) & (A['y2'] == row['y2']) & (A['y3'] == row['y3']) & (A['y4'] == row['y4']), axis=1).sum()
print(B)
x y1 y2 y3 y4 count
0 1 1 2 1 2 2
1 1 2 1 1 2 2
2 1 2 2 NaN 2 0
3 1 NaN 2 2 NaN 0
8 2 2 1 1 2 2
6 2 2 2 NaN 2 0
4 2 2 NaN NaN NaN 0
5 2 NaN NaN 2 1 0
Assuming you want to count the values and then get the sum across the different xs, you could use:
cols = ['x','y1','y2','y3','y4']
out = (A[cols].value_counts(dropna=False, sort=False)
.reset_index(name='count')
.sort_values(by=cols, na_position='last')
.assign(count=lambda x: x.groupby(cols[1:], dropna=False)
['count'].transform('sum')
)
)
Output:
x y1 y2 y3 y4 count
0 1 1 2 1 2 2
1 1 2 1 1 2 2
2 1 2 2 NaN 2 3
3 1 NaN 2 2 NaN 1
4 2 2 1 1 2 2
5 2 2 2 NaN 2 3
6 2 2 NaN NaN NaN 1
7 2 NaN NaN 2 1 1
If you want to set the rows with NaNs as 0:
cols = ['x','y1','y2','y3','y4']
out = (A[cols].value_counts(dropna=False, sort=False)
.reset_index(name='count')
.sort_values(by=cols, na_position='last')
.assign(count=lambda x: x.groupby(cols[1:])
['count'].transform('sum')
.fillna(0).convert_dtypes()
)
)
Output:
x y1 y2 y3 y4 count
0 1 1 2 1 2 2
1 1 2 1 1 2 2
2 1 2 2 NaN 2 0
3 1 NaN 2 2 NaN 0
4 2 2 1 1 2 2
5 2 2 2 NaN 2 0
6 2 2 NaN NaN NaN 0
7 2 NaN NaN 2 1 0
If you don't want to sum across the xs:
cols = ['x','y1','y2','y3','y4']
out = (A[cols].value_counts(dropna=False, sort=False)
.reset_index(name='count')
.sort_values(by=cols, na_position='last')
)
Output:
x y1 y2 y3 y4 count
0 1 1 2 1 2 2
1 1 2 1 1 2 1
2 1 2 2 NaN 2 1
3 1 NaN 2 2 NaN 1
4 2 2 1 1 2 1
5 2 2 2 NaN 2 2
6 2 2 NaN NaN NaN 1
7 2 NaN NaN 2 1 1