pythonconcatenationpython-polars

merge some columns in a Polars dataframe and duplicate the others


I have a similar problem to how to select all columns from a list in a polars dataframe, but slightly different:

import polars as pl
import numpy as np
import string

rng = np.random.default_rng(42)
nr = 3
letters = list(string.ascii_letters)
uppercase = list(string.ascii_uppercase)
words, groups = [], []
for i in range(nr):
    word = ''.join([rng.choice(letters) for _ in range(rng.integers(3, 20))])
    words.append(word)
    group = rng.choice(uppercase)
    groups.append(group)
df = pl.DataFrame(
    {
        "a_0": np.linspace(0, 1, nr),
        "a_1": np.linspace(1, 2, nr),
        "a_2": np.linspace(2, 3, nr),
        "b_0": np.random.rand(nr),
        "b_1": 2 * np.random.rand(nr),
        "b_2": 3 * np.random.rand(nr),
        "words": words,
        "groups": groups,
    }
)
print(df)  
shape: (3, 8)
┌─────┬─────┬─────┬──────────┬──────────┬──────────┬─────────────────┬────────┐
│ a_0 ┆ a_1 ┆ a_2 ┆ b_0      ┆ b_1      ┆ b_2      ┆ words           ┆ groups │
│ --- ┆ --- ┆ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---             ┆ ---    │
│ f64 ┆ f64 ┆ f64 ┆ f64      ┆ f64      ┆ f64      ┆ str             ┆ str    │
╞═════╪═════╪═════╪══════════╪══════════╪══════════╪═════════════════╪════════╡
│ 0.0 ┆ 1.0 ┆ 2.0 ┆ 0.653892 ┆ 0.234362 ┆ 0.880558 ┆ OIww            ┆ W      │
│ 0.5 ┆ 1.5 ┆ 2.5 ┆ 0.408888 ┆ 0.213767 ┆ 1.833025 ┆ KkeB            ┆ Z      │
│ 1.0 ┆ 2.0 ┆ 3.0 ┆ 0.423949 ┆ 0.646378 ┆ 0.116173 ┆ NLOAgRxAtjWOHuQ ┆ O      │
└─────┴─────┴─────┴──────────┴──────────┴──────────┴─────────────────┴────────┘

I want again to concatenate the columns a_0, a_1,... into a column a, columns b_0, b_1,... into a column b. However, unlike the preceding question, this time a = [a_0; a_1; ...]. I.e., all the elements of a_0 go first, followed by all the elements of a_1, etc. All the columns whose name doesn't end with a _ followed by a digit (in this example, words and groups) must be duplicated enough times to match the length of a. Let nr and nc be the number of rows/columns in df. Then the output dataframe must have m*nr rows (m=3 in this case) and nc-2*(m-1) columns, i.e.

shape: (9, 4)
┌─────────────────┬────────┬─────┬──────────┐
│ words           ┆ groups ┆ a   ┆ b        │
│ ---             ┆ ---    ┆ --- ┆ ---      │
│ str             ┆ str    ┆ f64 ┆ f64      │
╞═════════════════╪════════╪═════╪══════════╡
│ OIww            ┆ W      ┆ 0.0 ┆ 0.653892 │
│ KkeB            ┆ Z      ┆ 0.5 ┆ 0.408888 │
│ NLOAgRxAtjWOHuQ ┆ O      ┆ 1.0 ┆ 0.423949 │
│ OIww            ┆ W      ┆ 1.0 ┆ 0.234362 │
│ KkeB            ┆ Z      ┆ 1.5 ┆ 0.213767 │
│ NLOAgRxAtjWOHuQ ┆ O      ┆ 2.0 ┆ 0.646378 │
│ OIww            ┆ W      ┆ 2.0 ┆ 0.880558 │
│ KkeB            ┆ Z      ┆ 2.5 ┆ 1.833025 │
│ NLOAgRxAtjWOHuQ ┆ O      ┆ 3.0 ┆ 0.116173 │
└─────────────────┴────────┴─────┴──────────┘

How can I do that?


Solution

  • You can extend this answer to your previous question by @jqurious to include index, such as words and groups, as follows:

    (
        df
        .unpivot(index=["words", "groups"])
        .with_columns(pl.col("variable").str.replace("_.*", ""))
        .with_columns(index = pl.int_range(pl.len()).over("variable"))
        .pivot(on="variable", index=["index", "words", "groups"], values="value")
        .drop("index")
    )
    

    Explanation

    Consider a simplified dataset with n = 2 and just three a_* / b_* columns.

    shape: (2, 8)
    ┌─────┬─────┬─────┬──────────┬──────────┬──────────┬──────────────────┬────────┐
    │ a_0 ┆ a_1 ┆ a_2 ┆ b_0      ┆ b_1      ┆ b_2      ┆ words            ┆ groups │
    │ --- ┆ --- ┆ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---              ┆ ---    │
    │ f64 ┆ f64 ┆ f64 ┆ f64      ┆ f64      ┆ f64      ┆ str              ┆ str    │
    ╞═════╪═════╪═════╪══════════╪══════════╪══════════╪══════════════════╪════════╡
    │ 0.0 ┆ 1.0 ┆ 2.0 ┆ 0.285304 ┆ 1.261851 ┆ 0.295949 ┆ VUvcCgzrycGaKSve ┆ I      │
    │ 1.0 ┆ 2.0 ┆ 3.0 ┆ 0.460023 ┆ 1.89468  ┆ 1.042234 ┆ GXFVckCws        ┆ O      │
    └─────┴─────┴─────┴──────────┴──────────┴──────────┴──────────────────┴────────┘
    

    Using index in the initial pl.DataFrame.unpivot adds the content of the words / groups columns to each row after the unpivot. We also ensure that the index column is created within each group defined by words, groups, and variable.

    (
        df
        .unpivot(index=["words", "groups"])
        .with_columns(pl.col("variable").str.replace("_.*", ""))
        .with_columns(index = pl.int_range(pl.len()).over("variable"))
        .sort("words", "groups", "variable")
    )
    
    shape: (12, 5)
    ┌──────────────────┬────────┬──────────┬──────────┬───────┐
    │ words            ┆ groups ┆ variable ┆ value    ┆ index │
    │ ---              ┆ ---    ┆ ---      ┆ ---      ┆ ---   │
    │ str              ┆ str    ┆ str      ┆ f64      ┆ i64   │
    ╞══════════════════╪════════╪══════════╪══════════╪═══════╡
    │ GXFVckCws        ┆ O      ┆ a        ┆ 1.0      ┆ 0     │
    │ GXFVckCws        ┆ O      ┆ a        ┆ 2.0      ┆ 1     │
    │ GXFVckCws        ┆ O      ┆ a        ┆ 3.0      ┆ 2     │
    │ GXFVckCws        ┆ O      ┆ b        ┆ 0.460023 ┆ 0     │
    │ GXFVckCws        ┆ O      ┆ b        ┆ 1.89468  ┆ 1     │
    │ GXFVckCws        ┆ O      ┆ b        ┆ 1.042234 ┆ 2     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ a        ┆ 0.0      ┆ 0     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ a        ┆ 1.0      ┆ 1     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ a        ┆ 2.0      ┆ 2     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ b        ┆ 0.285304 ┆ 0     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ b        ┆ 1.261851 ┆ 1     │
    │ VUvcCgzrycGaKSve ┆ I      ┆ b        ┆ 0.295949 ┆ 2     │
    └──────────────────┴────────┴──────────┴──────────┴───────┘
    

    Then, the final pivot does no longer group by just index but also the words and groups columns.