I have a data frame which I need to put into long format. Here is a simplified version of my data frame:
df <- data.frame("UserID" = c(1:3), "ProdID" = c(1, 1, 2), "A_Q_FaultID_C_1" = c(3, 2, 4), "A_Q_FaultID_C_2" = c(8, 1, 2), "A_Q_FaultID_C_3" = c(4, 3, 9), "A_Q_FaultYear_C_1" = c(1999, 2018, 2012), "A_Q_FaultYear_C_2" = c(2002, 2008, 2024), "A_Q_FaultYear_C_3" = c(2015, 2015, 1997))
The idea is to end up with a data frame that looks like this by pivoting longer based on the variable names with the User and Prod IDs as keys:
df <- data.frame("UserID" = c(1:3, 1:3, 1:3), "ProdID" = c(1, 1, 2, 1, 1, 2, 1, 1, 2), "FaultID" = c(3, 2, 4, 8, 1, 2, 4, 3, 9), "FaultYear" = c(1999, 2018, 2012, 2002, 2008, 2024, 2015, 2015, 1997))
In reality, the data frame has many more rows, and many more columns within each pivot group, but the general naming convention for the columns I want to pivot longer is the same, i.e. they begin with "A_Q_" and end with "C[numeric string]". Note that the numeric string isn't necessarily just a single digit.
Here is my attempt using the tidyr pivot_longer function:
df <- df %>%
pivot_longer(
cols = 3:8,
names_to = c("FaultID", "FaultYear"),
names_pattern = "[a-zA-Z_]+"
)
This gives the error: "regex
should define 2 groups; 0 found.", however I tested this regex using the column names and str_match and it extracts everything except the trailing digits, uniquely identifying two groups:
str_match(names(df)[3:8], "[a-zA-Z_]+")
[,1]
[1,] "A_Q_FaultID_C_"
[2,] "A_Q_FaultID_C_"
[3,] "A_Q_FaultID_C_"
[4,] "A_Q_FaultYear_C_"
[5,] "A_Q_FaultYear_C_"
[6,] "A_Q_FaultYear_C_"
What am I doing wrong?
For this situation, it's easier to use names_to = ".value"
, where ".value"
is a special "sentinel" that will pivot to multiple columns based on names extracted from cols
. Then, in names_pattern
, you need to include a regex pattern with a capture group (i.e., "()"
) that extracts the portion of the column names that defines the new columns.
df %>%
pivot_longer(
cols = A_Q_FaultID_C_1:A_Q_FaultYear_C_3,
names_to = ".value",
names_pattern = "(Fault[a-zA-Z]+)"
)
# # A tibble: 9 × 4
# UserID ProdID FaultID FaultYear
# <int> <dbl> <dbl> <dbl>
# 1 1 1 3 1999
# 2 1 1 8 2002
# 3 1 1 4 2015
# 4 2 1 2 2018
# 5 2 1 1 2008
# 6 2 1 3 2015
# 7 3 2 4 2012
# 8 3 2 2 2024
# 9 3 2 9 1997
PS - I strongly recommend using a tidyselect helper for the cols
argument (for example, A_Q_FaultID_C_1:A_Q_FaultYear_C_3
or starts_with("A_Q_")
) instead of a numeric range like 3:8
. The latter is more error-prone (e.g., if the number or order of columns changes, or just from miscounting).