pythoncalculated-columnspython-polars

In polars select all column ending with pattern and add new columns without pattern


I have the following dataframe:

import polars as pl
import numpy as np

df = pl.DataFrame({
    "nrs": [1, 2, 3, None, 5],
    "names_A0": ["foo", "ham", "spam", "egg", None],
    "random_A0": np.random.rand(5),
    "A_A2": [True, True, False, False, False],
})
digit = 0

For each column X whose name ends with the string suf =f'_A{digit}', I want to add an identical column to df, whose name is the same as X, but without suf.

In the example, I need to add columns names and random to the original dataframe df, whose content is identical to that of columns names_A0 and random_A0 respectively.


Solution

  • You can you Polars Selectors along with some basic strings operations to accomplish this. Depending on what you how you expect this problem to evolve, you can jump straight to regular expressions, or use polars.selectors.ends_with/string.removesuffix

    String Suffix Operations

    This approach uses

    - polars.selectors.ends_with # find columns ending with string
    - string.removesuffix        # remove suffix from end of string
    

    translating to

    import polars as pl
    from polars import selectors as cs
    import numpy as np
    import re
    from functools import partial
    
    df = pl.DataFrame(
        {
            "nrs": [1, 2, 3, None, 5],
            "names_A0": ["foo", "ham", "spam", "egg", None],
            "random_A0": np.random.rand(5),
            "A_A2": [True, True, False, False, False],
        }
    )
    digit = 0
    suffix = f'_A{digit}'
    
    print(
        # keep original A0 columns
        df.with_columns(
            cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
        ),
        # shape: (5, 6)
        # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
        # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
        # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
        # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
        # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
        # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
        # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
        # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
        # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
        # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
        # └──────┴──────────┴───────────┴───────┴───────┴──────────┘
    
    
        # drop original A0 columns
        df.select(
            ~cs.ends_with(suffix),
            cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
        ),
        # shape: (5, 4)
        # ┌──────┬───────┬───────┬──────────┐
        # │ nrs  ┆ A_A2  ┆ names ┆ random   │
        # │ ---  ┆ ---   ┆ ---   ┆ ---      │
        # │ i64  ┆ bool  ┆ str   ┆ f64      │
        # ╞══════╪═══════╪═══════╪══════════╡
        # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
        # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
        # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
        # │ null ┆ false ┆ egg   ┆ 0.528783 │
        # │ 5    ┆ false ┆ null  ┆ 0.583206 │
        # └──────┴───────┴───────┴──────────┘
    
        sep='\n\n'
    )
    

    Regular Expressions

    Alternatively you can use regular expressions to detect a range of suffix patterns

    - polars.selectors.matches  # find columns matching a pattern
    - re.sub                    # substitute in string based on pattern
    

    We will need to ensure our pattern ends with a '$' to anchor the pattern to the end of the string.

    import polars as pl
    from polars import selectors as cs
    import numpy as np
    import re
    from functools import partial
    
    df = pl.DataFrame(
        {
            "nrs": [1, 2, 3, None, 5],
            "names_A0": ["foo", "ham", "spam", "egg", None],
            "random_A0": np.random.rand(5),
            "A_A2": [True, True, False, False, False],
        }
    )
    digit=0
    suffix = fr'_A{digit}$'
    
    print(
        # keep original A0 columns
        df.with_columns(
            cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
        ),
        # shape: (5, 6)
        # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
        # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
        # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
        # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
        # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
        # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
        # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
        # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
        # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
        # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
        # └──────┴──────────┴───────────┴───────┴───────┴──────────┘
    
    
        # drop original A0 columns
        df.select(
            ~cs.matches(suffix),
            cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
        ),
        # shape: (5, 4)
        # ┌──────┬───────┬───────┬──────────┐
        # │ nrs  ┆ A_A2  ┆ names ┆ random   │
        # │ ---  ┆ ---   ┆ ---   ┆ ---      │
        # │ i64  ┆ bool  ┆ str   ┆ f64      │
        # ╞══════╪═══════╪═══════╪══════════╡
        # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
        # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
        # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
        # │ null ┆ false ┆ egg   ┆ 0.528783 │
        # │ 5    ┆ false ┆ null  ┆ 0.583206 │
        # └──────┴───────┴───────┴──────────┘
    
        sep='\n\n'
    )