I have an application that takes an input spreadsheet filled in by the user. I'm trying to bug fix, and I've just noticed that in one True/False column, they've written FLASE
instead of FALSE
. I'm trying to write in as many workarounds for user error as I can, as the users of this app aren't very technical, so I was wondering if there's a way to convert this column to type bool, but set any typos to False? I appreciate this would also set something like TURE to be false too.
For example:
df = pd.DataFrame({'bool_col':['True', 'Flase', 'False', 'True'], 'foo':[1,2,3,4]})
Running df['bool_col'] = df['bool_col'].astype(bool)
returns True for everything (as they're all non-empty strings), however I would like it to reutrn True, False, False, True.
If you want a generic approach, you could use fuzzy matching, for example with thefuzz
:
from thefuzz import process
def to_bool(s, threshold=60):
bools = [True, False]
choices = list(map(str, bools))
match, score = process.extractOne(s, choices)
d = dict(zip(choices, bools))
if score > threshold:
return d[match]
df['bool'] = df['bool_col'].map(to_bool)
Example:
bool_col foo bool
0 True 1 True
1 Flase 2 False
2 False 3 False
3 True 4 True
4 ture 5 True
5 banana 6 None