Coming from R I am remaking some exercises that helped me a lot. So Trying to recreate this R code:
wide_data <- read_csv('https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/life-expectancy-and-fertility-two-countries-example.csv')
new_tidy_data <- pivot_longer(wide_data, `1960`:`2015`, names_to = "year", values_to = "fertility")
The data looks like this (I don't know how to paste the output) but has 113 columns: First is country, and then 1960_fertility 1960_life_expectancy 1961_fertility 1961_life_expectancy ..... 2015_fertility 2015_life_expectancy
and 2 rows germany, south Korea
expected result:
head(new_tidy_data)
#> # A tibble: 6 × 3
#> country year fertility
#> <chr> <chr> <dbl>
#> 1 Germany 1960 2.41
#> 2 Germany 1961 2.44
#> 3 Germany 1962 2.47
#> 4 Germany 1963 2.49
#> 5 Germany 1964 2.49
#> # ℹ 1 more row
So far my code looks like this:
import polars as pl
import polars.selectors as cs
df = pl.read_csv('https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/life-expectancy-and-fertility-two-countries-example.csv')
df.pivot()
Thanks!!
In polars you've got pivot
which makes things wider and then unpivot
to make them longer.
Unpivot won't split up the single columns with a delimitator into two columns for you, you've got to do that yourself.
That looks like this...
(
df
.unpivot(index='country', value_name='fertility')
.with_columns(
pl.col('variable').str.splitn('_',2).struct.rename_fields(['year','var'])
)
.unnest('variable')
.filter(pl.col('var')=='fertility')
.drop('var')
.sort('country')
)
The way polars expressions work is that for every input there is just one output but there's a type called a struct which can be nested with as many columns as is needed. In that way we splitn
the variable column into the year part and the var part. We can convert that struct into two regular columns with unnest
which is dispatched at the df level rather than as an expression.
On second though, since you want to filter for only fertility anyway, you can pre-filter and rename the columns before you unpivot like this:
(
df
.select(
pl.col("^(country|.+fertility)$")
.name.map(lambda x: x.replace("_fertility", "")))
.unpivot(index='country', variable_name='year', value_name='fertility')
.sort('country')
)
shape: (112, 3)
┌─────────────┬──────┬───────────┐
│ country ┆ year ┆ fertility │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═════════════╪══════╪═══════════╡
│ Germany ┆ 1960 ┆ 2.41 │
│ Germany ┆ 1961 ┆ 2.44 │
│ Germany ┆ 1962 ┆ 2.47 │
│ Germany ┆ 1963 ┆ 2.49 │
│ Germany ┆ 1964 ┆ 2.49 │
│ … ┆ … ┆ … │
│ South Korea ┆ 2011 ┆ 1.29 │
│ South Korea ┆ 2012 ┆ 1.3 │
│ South Korea ┆ 2013 ┆ 1.32 │
│ South Korea ┆ 2014 ┆ 1.34 │
│ South Korea ┆ 2015 ┆ 1.36 │
└─────────────┴──────┴───────────┘
If you wanted a column for fertility and life_expectancy, you'd need to combine the first approach with a pivot at the end like this:
(
df
.unpivot(index='country')
.with_columns(
pl.col('variable').str.splitn('_',2).struct.rename_fields(['year','var'])
)
.unnest('variable')
.pivot(on='var', index=['country','year'])
.sort('country')
)
shape: (112, 4)
┌─────────────┬──────┬───────────┬─────────────────┐
│ country ┆ year ┆ fertility ┆ life_expectancy │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════════╪══════╪═══════════╪═════════════════╡
│ Germany ┆ 1960 ┆ 2.41 ┆ 69.26 │
│ Germany ┆ 1961 ┆ 2.44 ┆ 69.85 │
│ Germany ┆ 1962 ┆ 2.47 ┆ 70.01 │
│ Germany ┆ 1963 ┆ 2.49 ┆ 70.1 │
│ Germany ┆ 1964 ┆ 2.49 ┆ 70.66 │
│ … ┆ … ┆ … ┆ … │
│ South Korea ┆ 2011 ┆ 1.29 ┆ 80.6 │
│ South Korea ┆ 2012 ┆ 1.3 ┆ 80.7 │
│ South Korea ┆ 2013 ┆ 1.32 ┆ 80.9 │
│ South Korea ┆ 2014 ┆ 1.34 ┆ 80.9 │
│ South Korea ┆ 2015 ┆ 1.36 ┆ 81.0 │
└─────────────┴──────┴───────────┴─────────────────┘