pythonstringdata-manipulationpython-polars

String manipulation in polars


I have a record in polars which has no header so far. This header should refer to the first row of the record. Before I instantiate this row as header, I want to manipulate the entries.

import polars as pl
# Creating a dictionary with the data
data = {
    "Column_1": ["ID", 4, 4, 4, 4],
    "Column_2": ["LocalValue", "B", "C", "D", "E"],
    "Column_3": ["Data\nField", "Q", "R", "S", "T"],
    "Column_4": [None, None, None, None, None],
    "Column_5": ["Global Value", "G", "H", "I", "J"],
}
# Creating the dataframe
table = pl.DataFrame(data, strict=False)
print(table)
shape: (5, 5)
┌──────────┬────────────┬──────────┬──────────┬──────────────┐
│ Column_1 ┆ Column_2   ┆ Column_3 ┆ Column_4 ┆ Column_5     │
│ ---      ┆ ---        ┆ ---      ┆ ---      ┆ ---          │
│ str      ┆ str        ┆ str      ┆ null     ┆ str          │
╞══════════╪════════════╪══════════╪══════════╪══════════════╡
│ ID       ┆ LocalValue ┆ Data     ┆ null     ┆ Global Value │
│          ┆            ┆ Field    ┆          ┆              │
│ 4        ┆ B          ┆ Q        ┆ null     ┆ G            │
│ 4        ┆ C          ┆ R        ┆ null     ┆ H            │
│ 4        ┆ D          ┆ S        ┆ null     ┆ I            │
│ 4        ┆ E          ┆ T        ┆ null     ┆ J            │
└──────────┴────────────┴──────────┴──────────┴──────────────┘

First, I want to replace line breaks and spaces between words with an underscore. Furthermore I want to fill Camel Cases with an underscore (e.g. TestTest -> Test_Test). Finally, all entries should be lowercase. For this I wrote the following function:

def clean_dataframe_columns(df):
    header = list(df.head(1).transpose().to_series())
    cleaned_headers = []
    for entry in header:
        if entry:
            entry = (
                entry.replace("\n", "_")
                .replace("(?<=[a-z])(?=[A-Z])", "_")
                .replace("\s", "_")
                .to_lowercase()
            )
        else:
            entry = "no_column"
        cleaned_headers.append(entry)
    df.columns = cleaned_headers
    return df

Unfortunately I have the following error. What am I doing wrong?

# AttributeError: 'int' object has no attribute 'replace'

The goal should be this dataframe:

shape: (4, 5)
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id  ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ ---         ┆ ---        ┆ ---       ┆ ---          │
│ i64 ┆ str         ┆ str        ┆ f64       ┆ str          │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4   ┆ B           ┆ Q          ┆ null      ┆ G            │
│ 4   ┆ C           ┆ R          ┆ null      ┆ H            │
│ 4   ┆ D           ┆ S          ┆ null      ┆ I            │
│ 4   ┆ E           ┆ T          ┆ null      ┆ J            │
└─────┴─────────────┴────────────┴───────────┴──────────────┘

Solution

  • I solved it on my own with this approach:

    def clean_select_columns(self, df: pl.DataFrame) -> pl.DataFrame:
        """
        Clean columns from a dataframe.
    
        :param df: input Dataframe
        :return: Dataframe with cleaned columns
    
        The function takes a loaded Dataframe and performs the following operations:
    
            Transposes the first row of the dataframe to get the header
            Selects the required columns defined in the list required_columns
            Cleans the header names by:
                1. Replacing special characters with underscores
                2. Converting CamelCase strings to snake_case strings
                3. Converting all columns to lowercase
                4. Naming columns with no names as "no_column_X", where X is a unique integer
                5. Returns the cleaned dataframe.
        """
        header = list(df.head(1).transpose().to_series())
        cleaned_headers = []
        i = 0
        for entry in header:
            if entry:
                entry = (
                    re.sub(r"(?i)([\n ?])", "",
                    re.sub(r"(?<!^)(?=[A-Z][a-z])", "_", entry))
                    .lower()
                )
            else:
                entry = f"no_column_{i}"
            cleaned_headers.append(entry)
            i += 1
        df.columns = cleaned_headers
        return df