rdataframetidyrdata-cleaning

Pad column numbers while using separate_wider_delim in R


I have a dataset in R that contains a column that contains a string that I want to separate into multiple columns using separate_wider_delim from the tidyr package.

What I want to do is to pad the column names so that they will always result in a number that is 2 digits (ie. '01' instead of '1').

However, the resulting dataframe had the new column names ending in a single digit number.

Does anyone know how to implement number padding in separate_wider_delim?

Below is an example code to demonstrate what I currently am trying and my desired output.

Code:

library(tidyr)

#data
df <- data.frame(Group = c("A","B","C"),
                 fruits_selected = c(
"'Apple'+'Banana'+'Cherry'",
"'Peach'+'Banana'+'Apple'",
"'Orange'+'Banana'+'Cherry'")
)

#Separate the vectors in the "fruits_selected" column into multiple columns
df2 <- df %>%
separate_wider_delim(fruits_selected, delim="+", names_sep = "_")

Current output:

#Current output of the result
print(df2)

#> Group fruits_selected_1 fruits_selected_2 fruits_selected_3
#>  <chr> <chr>             <chr>             <chr>            
#> 1 A     'Apple'           'Banana'          'Cherry'         
#> 2 B     'Peach'           'Banana'          'Apple'          
#> 3 C     'Orange'          'Banana'         'Cherry'

Desired Output:

print(df2)
#> Group fruits_selected_01 fruits_selected_02 fruits_selected_03
#>  <chr> <chr>             <chr>             <chr>            
#> 1 A     'Apple'           'Banana'          'Cherry'         
#> 2 B     'Peach'           'Banana'          'Apple'          
#> 3 C     'Orange'          'Banana'         'Cherry'

Thank you so much for your assistance!


Solution

  • You could use the names_repair argument of tidyr::separate_wider_delim() along with a little regular expression magic.

    In this example, sub() is doing a single find and replace for each column name. It is looking for the pattern fruits_selected_(\\d) where () denotes a "capture group" and \\d is a single digit [0-9]. If this pattern is found, it is replaced by fruits_selected_0\\1 where \\1 indicates to use whatever was matched in the first (and only in this example) capture group.

    library(tidyr)
    
    data.frame(
      Group = c("A","B","C"),
      fruits_selected = c(
        "'Apple'+'Banana'+'Cherry'",
        "'Peach'+'Banana'+'Apple'",
        "'Orange'+'Banana'+'Cherry'"
      )
    ) %>%
      separate_wider_delim(
        fruits_selected, 
        delim = "+", 
        names_sep = "_",
        names_repair = ~ sub("fruits_selected_(\\d)", "fruits_selected_0\\1", .)
      )
    #> # A tibble: 3 × 4
    #>   Group fruits_selected_01 fruits_selected_02 fruits_selected_03
    #>   <chr> <chr>              <chr>              <chr>             
    #> 1 A     'Apple'            'Banana'           'Cherry'          
    #> 2 B     'Peach'            'Banana'           'Apple'           
    #> 3 C     'Orange'           'Banana'           'Cherry'
    

    Created on 2024-07-12 with reprex v2.1.0.9000

    Reprex files hosted with on GitHub