Suppose we have a DataFrame:
data = {'person_id': ['person_a', 'person_a', 'person_b','person_b', 'person_c','person_c'],
'categorical_data': ['new', 'new', 'ok', 'bad', 'new', 'bad']}
df = pd.DataFrame(data)
person_id categorical_data
0 person_a new
1 person_a new
2 person_b ok
3 person_b bad
4 person_c new
5 person_c bad
I want to expand the categorical data into multiple columns with counts of each category.
We can group by the person id to get counts:
count_categories = df.groupby('person_id')['categorical_data'].value_counts().reset_index(name='count')
person_id categorical_data count
0 person_a new 2
1 person_b bad 1
2 person_b ok 1
3 person_c bad 1
4 person_c new 1
Then I tried this to create the new columns:
pivoted = count_categories.set_index(['person_id','categorical_data']).unstack('categorical_data')
count
categorical_data bad new ok
person_id
person_a NaN 2.0 NaN
person_b 1.0 NaN 1.0
person_c 1.0 1.0 NaN
This is the form I want, but I'm confused by the MultiIndexing
How can I get rid of the index, or is there a better way to do this? trying reset index yields:
pivoted.reset_index()
person_id count
categorical_data bad new ok
0 person_a NaN 2.0 NaN
1 person_b 1.0 NaN 1.0
2 person_c 1.0 1.0 NaN
Code
use crosstab
out = pd.crosstab(df['person_id'], df['categorical_data'])
out
categorical_data bad new ok
person_id
person_a 0 2 0
person_b 1 0 1
person_c 1 1 0
Or
out1 = (pd.crosstab(df['person_id'], df['categorical_data'])
.reset_index()
.rename_axis(None, axis=1)
)
out1
person_id bad new ok
0 person_a 0 2 0
1 person_b 1 0 1
2 person_c 1 1 0
I don't know your exact desired output, if it's not both out
and out1
, plz provide it.