rtidyr

pivot_longer with many variables and inconsistent names


I'm working on a large wide dataset and I need to convert it to several datasets in long format. How can I use pivot_longer to do this? The problem is that the variable names include counts at the end of their names, but in an inconsistent way (and the sets of variables do not have the same length). E.g. sometimes counts range from *01 to *88, sometimes counts range from *1 to *26, and sometimes the *1 is completely left out.

I'll include a script that generates a strongly simplified version of what my data looks like below. In reality, it includes several thousands of variables, with a few more inconsistencies.

library(tidyr)
set.seed(1234)
m <- matrix(sample(1:10, 75, replace = T), 5, 15)
id <- 1:5
d <- cbind(id, as.data.frame(m))
names(d) <- c("id", "X1_hg", "X1_hg2", "X1_hg3", "X1_hg4", "cuphg_m01", 
              "cuphg_m02", "cuphg_m03", "cuphg_m04", "cuphg_m05", "cuphg_v01",
              "cuphg_v02", "cuphg_v03", "cuphg_v04", "cuphg_v05", "cuphg_v06")

The result I'm looking for is a long data frame with colums: id, index (or something similar), X1_hg, cuphg_m, cuphg_v.

Any help is much appreciated!

I tried to search for a way to explicitly tell pivot_longer what sets of variables to combine from the wide data, preferably with some kind of efficient way to address them (e.g. X1:X10 or starts_with("X")). However, I was unable to make that work.

Maybe there is a way of renaming the variables efficiently, and then use pivot_longer on the new variable names with consistent counts, but I also wouldn't know how to accomplish that (other than renaming them one at a time, which is highly undesirable).

I have reviewed other questions on SO on pivot_longer with multiple variables, but I don't see a way to make solutions with names_prefix, names_sep, or names_pattern work.


Solution

  • library(tidyverse)
    d %>%
       rename(X1_hg1 = X1_hg) %>%
       pivot_longer(-id, names_to = c(".value", "index"),
                    names_pattern = "(.*?)0?(\\d+)$")
    
    # A tibble: 30 × 5
          id index X1_hg  cuphg_m  cuphg_v
       <int> <chr> <int>    <int>    <int>
     1     1 1        10        3        9
     2     1 2         6        8       10
     3     1 3        10        3        6
     4     1 4         4        2        8
     5     1 5        NA        8        5
     6     1 6        NA       NA        4
     7     2 1         6        4        8
     8     2 2         4        4        6
     9     2 3         6        6        9
    10     2 4         5        5        9
    # ℹ 20 more rows
    # ℹ Use `print(n = ...)` to see more rows