I'm trying to restructure a dataframe by combining values from multiple pairs of columns into a single column for each pair. For example, I've got a column called age_pre
and a column called age_post
, and I want to combine these two columns into one column called age
. Same with the exposure_pre
and exposure_post
columns.
df = pl.from_repr("""
┌─────┬─────────┬──────────┬──────────────┬───────────────┐
│ id ┆ age_pre ┆ age_post ┆ exposure_pre ┆ exposure_post │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪═════════╪══════════╪══════════════╪═══════════════╡
│ 123 ┆ 55 ┆ 56 ┆ 0.49 ┆ 0.51 │
│ 123 ┆ 56 ┆ 57 ┆ 0.49 ┆ 0.51 │
│ 456 ┆ 49 ┆ 50 ┆ 0.8 ┆ 0.2 │
│ 456 ┆ 50 ┆ 51 ┆ 0.8 ┆ 0.2 │
└─────┴─────────┴──────────┴──────────────┴───────────────┘
""")
My desired output would look like this
shape: (8, 3)
┌─────┬─────┬──────────┐
│ id ┆ age ┆ exposure │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════╡
│ 123 ┆ 55 ┆ 0.49 │
│ 123 ┆ 56 ┆ 0.51 │
│ 123 ┆ 56 ┆ 0.49 │
│ 123 ┆ 57 ┆ 0.51 │
│ 456 ┆ 49 ┆ 0.8 │
│ 456 ┆ 50 ┆ 0.2 │
│ 456 ┆ 50 ┆ 0.8 │
│ 456 ┆ 51 ┆ 0.2 │
└─────┴─────┴──────────┘
I tried creating multiple unpivoted dataframes and using .hstack()
to create a dataframe with my desired output, put this seems really inefficient. How can I do this with the Polars syntax?
You can use pl.concat_list
then .explode
.
out = df.with_columns(
pl.concat_list("age_pre", "age_post").alias("age"),
pl.concat_list("exposure_pre", "exposure_post").alias("exposure"),
).select("id", "age", "exposure").explode("age", "exposure")
print(out)
shape: (8, 3)
┌─────┬─────┬──────────┐
│ id ┆ age ┆ exposure │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════╡
│ 123 ┆ 55 ┆ 0.49 │
│ 123 ┆ 56 ┆ 0.51 │
│ 123 ┆ 56 ┆ 0.49 │
│ 123 ┆ 57 ┆ 0.51 │
│ 456 ┆ 49 ┆ 0.8 │
│ 456 ┆ 50 ┆ 0.2 │
│ 456 ┆ 50 ┆ 0.8 │
│ 456 ┆ 51 ┆ 0.2 │
└─────┴─────┴──────────┘