rregextidyrlong-format-data

names_pattern in tidyr pivot_longer for multiple variables nested in the column names


I have a data frame which I need to put into long format. Here is a simplified version of my data frame:

df <- data.frame("UserID" = c(1:3), "ProdID" = c(1, 1, 2), "A_Q_FaultID_C_1" = c(3, 2, 4), "A_Q_FaultID_C_2" = c(8, 1, 2), "A_Q_FaultID_C_3" = c(4, 3, 9), "A_Q_FaultYear_C_1" = c(1999, 2018, 2012), "A_Q_FaultYear_C_2" = c(2002, 2008, 2024), "A_Q_FaultYear_C_3" = c(2015, 2015, 1997))  

The idea is to end up with a data frame that looks like this by pivoting longer based on the variable names with the User and Prod IDs as keys:

df <- data.frame("UserID" = c(1:3, 1:3, 1:3), "ProdID" = c(1, 1, 2, 1, 1, 2, 1, 1, 2), "FaultID" = c(3, 2, 4, 8, 1, 2, 4, 3, 9), "FaultYear" = c(1999, 2018, 2012, 2002, 2008, 2024, 2015, 2015, 1997))  

In reality, the data frame has many more rows, and many more columns within each pivot group, but the general naming convention for the columns I want to pivot longer is the same, i.e. they begin with "A_Q_" and end with "C[numeric string]". Note that the numeric string isn't necessarily just a single digit.

Here is my attempt using the tidyr pivot_longer function:

df <- df %>%
  pivot_longer(
    cols = 3:8,
    names_to = c("FaultID", "FaultYear"),
    names_pattern = "[a-zA-Z_]+"
  )

This gives the error: "regex should define 2 groups; 0 found.", however I tested this regex using the column names and str_match and it extracts everything except the trailing digits, uniquely identifying two groups:

str_match(names(df)[3:8], "[a-zA-Z_]+")

     [,1]              
[1,] "A_Q_FaultID_C_"  
[2,] "A_Q_FaultID_C_"  
[3,] "A_Q_FaultID_C_"  
[4,] "A_Q_FaultYear_C_"
[5,] "A_Q_FaultYear_C_"
[6,] "A_Q_FaultYear_C_"

What am I doing wrong?


Solution

  • For this situation, it's easier to use names_to = ".value", where ".value" is a special "sentinel" that will pivot to multiple columns based on names extracted from cols. Then, in names_pattern, you need to include a regex pattern with a capture group (i.e., "()") that extracts the portion of the column names that defines the new columns.

    df %>%
      pivot_longer(
        cols = A_Q_FaultID_C_1:A_Q_FaultYear_C_3,
        names_to = ".value",
        names_pattern = "(Fault[a-zA-Z]+)"
      )
    # # A tibble: 9 × 4
    #   UserID ProdID FaultID FaultYear
    #    <int>  <dbl>   <dbl>     <dbl>
    # 1      1      1       3      1999
    # 2      1      1       8      2002
    # 3      1      1       4      2015
    # 4      2      1       2      2018
    # 5      2      1       1      2008
    # 6      2      1       3      2015
    # 7      3      2       4      2012
    # 8      3      2       2      2024
    # 9      3      2       9      1997
    

    PS - I strongly recommend using a tidyselect helper for the cols argument (for example, A_Q_FaultID_C_1:A_Q_FaultYear_C_3 or starts_with("A_Q_")) instead of a numeric range like 3:8. The latter is more error-prone (e.g., if the number or order of columns changes, or just from miscounting).