I am working with some complicated patient health data and would like to simplify categorizing the types of insurance patients use. However, some patients use multiple insurance types to pay for a medical visit (ie Medicare and supplemental insurance). I have already categorized insurance as either private or public and would now like to categorize if patients are using more than one insurance type.
My current data set looks like below:
import pandas as pd
data = {'patient_ID': [1,1,1,2,2,3,3,3,4,4,4,4],
'Public_Private': ['Public', 'Public', 'Public',
'Private', 'Public','Private', 'Private',
'Private', 'Private', 'Public', 'Private',
'Public']}
df = pd.DataFrame(data)
df
patient_ID | Public_Private |
---|---|
1 | Public |
1 | Private |
2 | Private |
2 | Private |
2 | Private |
If the types of insurance differ for a given patient ID, in a new column, I want to return "Mixed". If the types of insurance are the same, I want to return the original value. For example,
patient_ID | Public_Private | Simple |
---|---|---|
1 | Public | Mixed |
1 | Private | Mixed |
2 | Private | Private |
2 | Private | Private |
2 | Private | Private |
As you can see, the number of rows for each patient_ID will vary. The code will need to be flexible to account for this.
You can group by the patient id, select the public_private column from the groups, and then apply a transformation, which you can populate back to the original rows of the groups in a new column.
import pandas as pd
data = {
'patient_ID': [1, 1, 2, 2, 3, 3],
'Public_Private': ['Public', 'Private',
'Private', 'Private',
'Public', 'Public'],
'Other_info': [1.1, 1.2,
2.1, 2.2,
3.1, 3.2]
}
df = pd.DataFrame(data)
print(df)
# patient_ID Public_Private Other_info
# 1 Public 1.1
# 1 Private 1.2
# 2 Private 2.1
# 2 Private 2.2
# 3 Public 3.1
# 3 Public 3.2
def simple(g_arr: pd.Series):
# ex. g_arr = ['Private', 'Public'] or ['Public', 'Public'], etc.
if len(set(g_arr)) == 1:
return g_arr.iloc[0]
return 'Mixed'
df['Simple'] = df.groupby('patient_ID')['Public_Private'].transform(simple)
print(df)
# patient_ID Public_Private Other_info Simple
# 1 Public 1.1 Mixed
# 1 Private 1.2 Mixed
# 2 Private 2.1 Private
# 2 Private 2.2 Private
# 3 Public 3.1 Public
# 3 Public 3.2 Public