pythonpivotpython-polars

Pivot a dataframe using polars.pivot() (like pivot_longer in R)


Coming from R I am remaking some exercises that helped me a lot. So Trying to recreate this R code:

wide_data <- read_csv('https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/life-expectancy-and-fertility-two-countries-example.csv')

new_tidy_data <- pivot_longer(wide_data, `1960`:`2015`, names_to = "year", values_to = "fertility")

The data looks like this (I don't know how to paste the output) but has 113 columns: First is country, and then 1960_fertility 1960_life_expectancy 1961_fertility 1961_life_expectancy ..... 2015_fertility 2015_life_expectancy

and 2 rows germany, south Korea

expected result:

head(new_tidy_data)
#> # A tibble: 6 × 3
#>   country year  fertility
#>   <chr>   <chr>     <dbl>
#> 1 Germany 1960       2.41
#> 2 Germany 1961       2.44
#> 3 Germany 1962       2.47
#> 4 Germany 1963       2.49
#> 5 Germany 1964       2.49
#> # ℹ 1 more row

So far my code looks like this:

import polars as pl
import polars.selectors as cs

df = pl.read_csv('https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/life-expectancy-and-fertility-two-countries-example.csv')
df.pivot()

Thanks!!


Solution

  • In polars you've got pivot which makes things wider and then melt (think of icicles) to make them longer.

    Melt won't split up the single columns with a delimitator into two columns for you, you've got to do that yourself.

    That looks like this...

    (
        df
        .melt('country', value_name='fertility')
        .with_columns(
            pl.col('variable').str.splitn('_',2).struct.rename_fields(['year','var'])
        )
        .unnest('variable')
        .filter(pl.col('var')=='fertility')
        .drop('var')
        .sort('country')
    )
    

    The way polars expressions work is that for every input there is just one output but there's a type called a struct which can be nested with as many columns as is needed. In that way we splitn the variable column into the year part and the var part. We can convert that struct into two regular columns with unnest which is dispatched at the df level rather than as an expression.

    On second though, since you want to filter for only fertility anyway, you can pre-filter and rename the columns before you melt like this:

    (
        df
        .select(
            pl.col("^(country|.+fertility)$")
            .name.map(lambda x: x.replace("_fertility", "")))
        .melt('country', variable_name='year', value_name='fertility')
        .sort('country')
    )
    shape: (112, 3)
    ┌─────────────┬──────┬───────────┐
    │ country     ┆ year ┆ fertility │
    │ ---         ┆ ---  ┆ ---       │
    │ str         ┆ str  ┆ f64       │
    ╞═════════════╪══════╪═══════════╡
    │ Germany     ┆ 1960 ┆ 2.41      │
    │ Germany     ┆ 1961 ┆ 2.44      │
    │ Germany     ┆ 1962 ┆ 2.47      │
    │ Germany     ┆ 1963 ┆ 2.49      │
    │ Germany     ┆ 1964 ┆ 2.49      │
    │ …           ┆ …    ┆ …         │
    │ South Korea ┆ 2011 ┆ 1.29      │
    │ South Korea ┆ 2012 ┆ 1.3       │
    │ South Korea ┆ 2013 ┆ 1.32      │
    │ South Korea ┆ 2014 ┆ 1.34      │
    │ South Korea ┆ 2015 ┆ 1.36      │
    └─────────────┴──────┴───────────┘
    

    Lastly

    If you wanted a column for fertility and life_expectancy, you'd need to combine the first approach with a pivot at the end like this:

    (
        df
        .melt('country')
        .with_columns(
            pl.col('variable').str.splitn('_',2).struct.rename_fields(['year','var'])
        )
        .unnest('variable')
        .pivot(
            values='value', index=['country','year'], 
            columns='var', aggregate_function='first')
        .sort('country')
    shape: (112, 4)
    ┌─────────────┬──────┬───────────┬─────────────────┐
    │ country     ┆ year ┆ fertility ┆ life_expectancy │
    │ ---         ┆ ---  ┆ ---       ┆ ---             │
    │ str         ┆ str  ┆ f64       ┆ f64             │
    ╞═════════════╪══════╪═══════════╪═════════════════╡
    │ Germany     ┆ 1960 ┆ 2.41      ┆ 69.26           │
    │ Germany     ┆ 1961 ┆ 2.44      ┆ 69.85           │
    │ Germany     ┆ 1962 ┆ 2.47      ┆ 70.01           │
    │ Germany     ┆ 1963 ┆ 2.49      ┆ 70.1            │
    │ Germany     ┆ 1964 ┆ 2.49      ┆ 70.66           │
    │ …           ┆ …    ┆ …         ┆ …               │
    │ South Korea ┆ 2011 ┆ 1.29      ┆ 80.6            │
    │ South Korea ┆ 2012 ┆ 1.3       ┆ 80.7            │
    │ South Korea ┆ 2013 ┆ 1.32      ┆ 80.9            │
    │ South Korea ┆ 2014 ┆ 1.34      ┆ 80.9            │
    │ South Korea ┆ 2015 ┆ 1.36      ┆ 81.0            │
    └─────────────┴──────┴───────────┴─────────────────┘
    )