pythonpandasboolean

Pandas convert string column to bool but convert typos to false


I have an application that takes an input spreadsheet filled in by the user. I'm trying to bug fix, and I've just noticed that in one True/False column, they've written FLASE instead of FALSE. I'm trying to write in as many workarounds for user error as I can, as the users of this app aren't very technical, so I was wondering if there's a way to convert this column to type bool, but set any typos to False? I appreciate this would also set something like TURE to be false too.

For example:

df = pd.DataFrame({'bool_col':['True', 'Flase', 'False', 'True'], 'foo':[1,2,3,4]})

Running df['bool_col'] = df['bool_col'].astype(bool) returns True for everything (as they're all non-empty strings), however I would like it to reutrn True, False, False, True.


Solution

  • If you want a generic approach, you could use fuzzy matching, for example with thefuzz:

    from thefuzz import process
    
    def to_bool(s, threshold=60):
        bools   = [True, False]
        choices = list(map(str, bools))
        match, score = process.extractOne(s, choices)
        d = dict(zip(choices, bools))
        if score > threshold:
            return d[match]
    
    df['bool'] = df['bool_col'].map(to_bool)
    

    Example:

      bool_col  foo   bool
    0     True    1   True
    1    Flase    2  False
    2    False    3  False
    3     True    4   True
    4     ture    5   True
    5   banana    6   None