I have a Polars Dataframe with a mix of Series, which I want to write to a CSV / Upload to a Database.
The problem is if any of the UTF8 series have non-ASCII characters, it is failing due to the DB Type I'm using so I would like to filter out the non-ASCII characters, whilst leaving everything else.
I created a function that uses a lambda function, which does work, but it is slow compared with standard Polars functions and I was hoping to replace this with a Polars alternative
def df_column_clean(df:pl.DataFrame, drop_non_ascii:bool=False):
"""
Takes a Polars Dataframe and performs data cleaning on all columns
Currently it only converts string series to ascii but can be expanded in the future
"""
if drop_non_ascii:
df_changes = []
df_columns = df.schema
for col_name, col_type in df_columns.items():
if col_type != pl.Utf8:
continue
# Remove non-ascii characters
df_changes.append(pl.col(col_name).apply(lambda x: None if x is None else x.encode('ascii', 'ignore').decode('ascii'), skip_nulls=False))
if len(df_changes) > 0:
return df.with_columns(df_changes)
return df
Is the method I came up with the best option or does Polars have an-inbuilt function that can be used to filter out non-ASCII characters?
Thanks in advance
.str.replace_all()
with a regex to match non-ascii chars:
pl.col(pl.String).str.replace_all(r"[^\p{Ascii}]", "")