rtidyr

Pivot a range of columns based on a character string in R


Assume a data frame like this:

df <- 
data.frame(
  total_income        = c(100, 500, 2000),
  ten_twenty_k        = c(90, 480, 300),
  twenty_one_thirty_k = c(10, 20, 1700),
  total_age           = c(1000, 2000, 3000),
  age_15_lower        = c(10, 20, 50),
  age_16_30           = c(900, 500, 100),
  age_31_50           = c(90, 1300, 2400),
  total_language      = c(200, 3000, 1500),
  english             = c(100, 1000, 600),
  french              = c(50, 1000, 500),
  spanish             = c(25, 100, 100),
  german              = c(10, 900, 150),
  russian             = c(15, 1000, 150)
) |> 
  tibble::rowid_to_column("index")

df

  index total_income ten_twenty_k twenty_one_thirty_k total_age age_15_lower age_16_30 age_31_50
1     1          100           90                  10      1000           10       900        90
2     2          500          480                  20      2000           20       500      1300
3     3         2000          300                1700      3000           50       100      2400
  total_language english french spanish german russian
1            200     100     50      25     10      15
2           3000    1000   1000     100    900    1000
3           1500     600    500     100    150     150


I'd like to pivot this data frame based on the columns that start with "total_". Basically, so it'd look something like this.

   index group_income count_income group_age    count_age group_language count_language
   <int> <chr>               <dbl> <chr>            <dbl> <chr>                   <dbl>
 1     1 total_income          100 total_age         1000 total_language            200
 2     1 total_income          100 total_age         1000 english                   100
 3     1 total_income          100 total_age         1000 french                     50
 4     1 total_income          100 total_age         1000 spanish                    25
 5     1 total_income          100 total_age         1000 german                     10
 6     1 total_income          100 total_age         1000 russian                    15
 7     1 total_income          100 age_15_lower        10 total_language            200
 8     1 total_income          100 age_15_lower        10 english                   100
 9     1 total_income          100 age_15_lower        10 french                     50
10     1 total_income          100 age_15_lower        10 spanish                    25

Afterwards, I would split the data frame into separate tables based on observational unit (i.e., separate objects/tables for income, age, language) and use distinct() from dplyr to remove the duplicate rows caused by level lengths in some variables being longer than others.

I know a tedious way of achieving this for a variable like income is using pivot_longer()like this:

library(tidyr)
library(dplyr)

df_income <-
df |>  
pivot_longer(
    cols      = 2:4,
    names_to  = "group_income",
    values_to = "count_income"
  ) |> 
  pivot_longer(
    cols      = 2:5,
    names_to  = "group_age",
    values_to = "count_age"
  ) |> 
    pivot_longer(
      cols      = 2:7,
      names_to  = "group_language",
      values_to = "count_language"
    ) |>
  select(
    1:3
  ) |> 
  distinct(income, count_income)

df_income

  index group_income        count_income
  <int> <chr>                      <dbl>
1     1 total_income                 100
2     1 ten_twenty_k                  90
3     1 twenty_one_thirty_k           10
4     2 total_income                 500
5     2 ten_twenty_k                 480
6     2 twenty_one_thirty_k           20
7     3 total_income                2000
8     3 ten_twenty_k                 300
9     3 twenty_one_thirty_k         1700

However, I know there's a more efficient way of doing this using regex within pivot_longer(). Fortunately, all the variables that need to be pivoted together follow the column that starts with "total_". However, I'm not quite sure how to tell R to pivot the columns that follow a "total_" column, including that "total_" column itself, but stop when the next "total_" column appears and do the same thing. Does anybody have suggestions on how to do this pivoting?

Please let me know if I can clarify anything. Thank you!


Solution

  • Since you want as your output three different data frames coming from three different groupings of data with different numbers of columns, it doesn't make a lot of sense to try to pivot first then split the output. You can more easily split your columns into groupings with regular expressions and pivot each inside lapply

    library(tidyverse)
    
    split(names(df)[-1], cumsum(grepl("total_", names(df)[-1]))) |>
      lapply(function(x) {
        df[c("index", x)] |> 
          pivot_longer(-1) |>
          set_names(c("index", 
                      paste0(c("group_", "count_"), gsub("total_", "", x[1]))))
      }) 
    

    This gives you all three of your desired output tables in a list:

    #> $`1`
    #> # A tibble: 9 x 3
    #>   index group_income        count_income
    #>   <int> <chr>                      <dbl>
    #> 1     1 total_income                 100
    #> 2     1 ten_twenty_k                  90
    #> 3     1 twenty_one_thirty_k           10
    #> 4     2 total_income                 500
    #> 5     2 ten_twenty_k                 480
    #> 6     2 twenty_one_thirty_k           20
    #> 7     3 total_income                2000
    #> 8     3 ten_twenty_k                 300
    #> 9     3 twenty_one_thirty_k         1700
    #> 
    #> $`2`
    #> # A tibble: 12 x 3
    #>    index group_age    count_age
    #>    <int> <chr>            <dbl>
    #>  1     1 total_age         1000
    #>  2     1 age_15_lower        10
    #>  3     1 age_16_30          900
    #>  4     1 age_31_50           90
    #>  5     2 total_age         2000
    #>  6     2 age_15_lower        20
    #>  7     2 age_16_30          500
    #>  8     2 age_31_50         1300
    #>  9     3 total_age         3000
    #> 10     3 age_15_lower        50
    #> 11     3 age_16_30          100
    #> 12     3 age_31_50         2400
    #> 
    #> $`3`
    #> # A tibble: 18 x 3
    #>    index group_language count_language
    #>    <int> <chr>                   <dbl>
    #>  1     1 total_language            200
    #>  2     1 english                   100
    #>  3     1 french                     50
    #>  4     1 spanish                    25
    #>  5     1 german                     10
    #>  6     1 russian                    15
    #>  7     2 total_language           3000
    #>  8     2 english                  1000
    #>  9     2 french                   1000
    #> 10     2 spanish                   100
    #> 11     2 german                    900
    #> 12     2 russian                  1000
    #> 13     3 total_language           1500
    #> 14     3 english                   600
    #> 15     3 french                    500
    #> 16     3 spanish                   100
    #> 17     3 german                    150
    #> 18     3 russian                   150