I have a CSV file with multiple categorical columns, but most of these columns contain messy data due to typing mistakes (e.g., 'spciulated', 'SPICULATED', etc. for the category 'spiculated' of the column 'margins'). Is there a standard way to deal with such situations?
To be precise, I would like to read the CSV file directly into a clean DataFrame with a dtype category for the categorical columns, but with all variants collapsed into one category (e.g., each variant of 'spiculated' would be read as 'spiculated'). The spelling variants could be given by a dict, for instance.
Expected solution:
import pandas as pd
FEAT_VALS = {
"margins": {
"spiculated": ["spiculated", "spiiculated", "SPICULATED"],
"circumscribed": ["circumscribed", "cicumscribed"],
}
}
# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins
where test.csv is:
margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed
to obtain:
0 spiculated
1 spiculated
2 spiculated
3 circumscribed
4 circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']
However, without the spelling variants information, I get:
0 spiculated
1 spiiculated
2 SPICULATED
3 circumscribed
4 cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']
My current solution: Looks like this
df2 = pd.read_csv('test.csv')
for feat, feat_vals in FEAT_VALS.items():
for enc_val, str_vals in feat_vals.items():
df2.loc[df2[feat].isin(str_vals), feat] = enc_val
df2.margins = df2.margins.astype('category')
You could flip the inner dicts then use .map()
. This isn't as direct as you want, but at least the code using Pandas is cleaner.
for feat, feat_vals in FEAT_VALS.items():
feat_strs = {
str_val: enc_val
for enc_val, str_vals in feat_vals.items()
for str_val in str_vals
}
df[feat] = df[feat].map(feat_strs).astype('category')
# For demo
print(df[feat].cat.categories)
Output:
Index(['circumscribed', 'spiculated'], dtype='object')