pythondataframepython-polars

Remove non-ASCII characters from a Polars Dataframe


I have a Polars Dataframe with a mix of Series, which I want to write to a CSV / Upload to a Database.

The problem is if any of the UTF8 series have non-ASCII characters, it is failing due to the DB Type I'm using so I would like to filter out the non-ASCII characters, whilst leaving everything else.

I created a function that uses a lambda function, which does work, but it is slow compared with standard Polars functions and I was hoping to replace this with a Polars alternative

def df_column_clean(df:pl.DataFrame, drop_non_ascii:bool=False):
    """
    Takes a Polars Dataframe and performs data cleaning on all columns
    Currently it only converts string series to ascii but can be expanded in the future
    """
    if drop_non_ascii:
        df_changes = []
        df_columns = df.schema
        for col_name, col_type in df_columns.items():
            if col_type != pl.Utf8:
                continue

            # Remove non-ascii characters
            df_changes.append(pl.col(col_name).apply(lambda x: None if x is None else x.encode('ascii', 'ignore').decode('ascii'), skip_nulls=False))

        if len(df_changes) > 0:
            return df.with_columns(df_changes)
    return df

Is the method I came up with the best option or does Polars have an-inbuilt function that can be used to filter out non-ASCII characters?

Thanks in advance


Solution

  • .str.replace_all() with a regex to match non-ascii chars:

    pl.col(pl.String).str.replace_all(r"[^\p{Ascii}]", "")