Assume a data frame like this:
df <-
data.frame(
total_income = c(100, 500, 2000),
ten_twenty_k = c(90, 480, 300),
twenty_one_thirty_k = c(10, 20, 1700),
total_age = c(1000, 2000, 3000),
age_15_lower = c(10, 20, 50),
age_16_30 = c(900, 500, 100),
age_31_50 = c(90, 1300, 2400),
total_language = c(200, 3000, 1500),
english = c(100, 1000, 600),
french = c(50, 1000, 500),
spanish = c(25, 100, 100),
german = c(10, 900, 150),
russian = c(15, 1000, 150)
) |>
tibble::rowid_to_column("index")
df
index total_income ten_twenty_k twenty_one_thirty_k total_age age_15_lower age_16_30 age_31_50
1 1 100 90 10 1000 10 900 90
2 2 500 480 20 2000 20 500 1300
3 3 2000 300 1700 3000 50 100 2400
total_language english french spanish german russian
1 200 100 50 25 10 15
2 3000 1000 1000 100 900 1000
3 1500 600 500 100 150 150
I'd like to pivot this data frame based on the columns that start with "total_". Basically, so it'd look something like this.
index group_income count_income group_age count_age group_language count_language
<int> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 total_income 100 total_age 1000 total_language 200
2 1 total_income 100 total_age 1000 english 100
3 1 total_income 100 total_age 1000 french 50
4 1 total_income 100 total_age 1000 spanish 25
5 1 total_income 100 total_age 1000 german 10
6 1 total_income 100 total_age 1000 russian 15
7 1 total_income 100 age_15_lower 10 total_language 200
8 1 total_income 100 age_15_lower 10 english 100
9 1 total_income 100 age_15_lower 10 french 50
10 1 total_income 100 age_15_lower 10 spanish 25
Afterwards, I would split the data frame into separate tables based on observational unit (i.e., separate objects/tables for income, age, language) and use distinct()
from dplyr
to remove the duplicate rows caused by level lengths in some variables being longer than others.
I know a tedious way of achieving this for a variable like income is using pivot_longer()
like this:
library(tidyr)
library(dplyr)
df_income <-
df |>
pivot_longer(
cols = 2:4,
names_to = "group_income",
values_to = "count_income"
) |>
pivot_longer(
cols = 2:5,
names_to = "group_age",
values_to = "count_age"
) |>
pivot_longer(
cols = 2:7,
names_to = "group_language",
values_to = "count_language"
) |>
select(
1:3
) |>
distinct(income, count_income)
df_income
index group_income count_income
<int> <chr> <dbl>
1 1 total_income 100
2 1 ten_twenty_k 90
3 1 twenty_one_thirty_k 10
4 2 total_income 500
5 2 ten_twenty_k 480
6 2 twenty_one_thirty_k 20
7 3 total_income 2000
8 3 ten_twenty_k 300
9 3 twenty_one_thirty_k 1700
However, I know there's a more efficient way of doing this using regex within pivot_longer()
. Fortunately, all the variables that need to be pivoted together follow the column that starts with "total_"
. However, I'm not quite sure how to tell R to pivot the columns that follow a "total_"
column, including that "total_"
column itself, but stop when the next "total_"
column appears and do the same thing. Does anybody have suggestions on how to do this pivoting?
Please let me know if I can clarify anything. Thank you!
Since you want as your output three different data frames coming from three different groupings of data with different numbers of columns, it doesn't make a lot of sense to try to pivot first then split the output. You can more easily split your columns into groupings with regular expressions and pivot each inside lapply
library(tidyverse)
split(names(df)[-1], cumsum(grepl("total_", names(df)[-1]))) |>
lapply(function(x) {
df[c("index", x)] |>
pivot_longer(-1) |>
set_names(c("index",
paste0(c("group_", "count_"), gsub("total_", "", x[1]))))
})
This gives you all three of your desired output tables in a list:
#> $`1`
#> # A tibble: 9 x 3
#> index group_income count_income
#> <int> <chr> <dbl>
#> 1 1 total_income 100
#> 2 1 ten_twenty_k 90
#> 3 1 twenty_one_thirty_k 10
#> 4 2 total_income 500
#> 5 2 ten_twenty_k 480
#> 6 2 twenty_one_thirty_k 20
#> 7 3 total_income 2000
#> 8 3 ten_twenty_k 300
#> 9 3 twenty_one_thirty_k 1700
#>
#> $`2`
#> # A tibble: 12 x 3
#> index group_age count_age
#> <int> <chr> <dbl>
#> 1 1 total_age 1000
#> 2 1 age_15_lower 10
#> 3 1 age_16_30 900
#> 4 1 age_31_50 90
#> 5 2 total_age 2000
#> 6 2 age_15_lower 20
#> 7 2 age_16_30 500
#> 8 2 age_31_50 1300
#> 9 3 total_age 3000
#> 10 3 age_15_lower 50
#> 11 3 age_16_30 100
#> 12 3 age_31_50 2400
#>
#> $`3`
#> # A tibble: 18 x 3
#> index group_language count_language
#> <int> <chr> <dbl>
#> 1 1 total_language 200
#> 2 1 english 100
#> 3 1 french 50
#> 4 1 spanish 25
#> 5 1 german 10
#> 6 1 russian 15
#> 7 2 total_language 3000
#> 8 2 english 1000
#> 9 2 french 1000
#> 10 2 spanish 100
#> 11 2 german 900
#> 12 2 russian 1000
#> 13 3 total_language 1500
#> 14 3 english 600
#> 15 3 french 500
#> 16 3 spanish 100
#> 17 3 german 150
#> 18 3 russian 150