rstringtexttidyversesplitstackshape

Transforming string column to specific data.frame


Desired Output

Need the following output

df2 <-
  data.frame(
    v1 = c(1100001, 1100002, 1100003, 1100004, 1100005)
  , v2 = c("A R", "W R", "A K", "M", "A C")
  , v3 = c("P", "G P", "G P", "P", "P")
  , v4 = c(110, 161, 129, 132, "Absent")
  , v5 = c(55, 80.5, 64.5, 66,  "Absent")
    )
df2

       v1  v2  v3     v4     v5
1 1100001 A R   P    110     55
2 1100002 W R G P    161   80.5
3 1100003 A K G P    129   64.5
4 1100004   M   P    132     66
5 1100005 A C   P Absent Absent

Here is my original data.frame.

df1 <-
  structure(list(value = c(
"1100001     A R                P             110    55", 
"1100002     W R                 G P 161    80.5", 
"1100003     A K                  G P 129    64.5", 
"1100004     M                      P             132    66",
"1100005     A C                     P             Absent    Absent"
)), row.names = c(NA, -5L), class = c("data.frame")
)

df1

                                                              value
1            1100001     A R                P             110    55
2                   1100002     W R                 G P 161    80.5
3                  1100003     A K                  G P 129    64.5
4        1100004     M                      P             132    66
5 1100005     A C                     P             Absent    Absent

Used cSplit function from splitstackshape, but could not manage to get the required output. Any points.

library(splitstackshape)
cSplit(indt = df1, splitCols = "value", sep = " ")

   value_1 value_2 value_3 value_4 value_5 value_6 value_7
1: 1100001       A       R       P     110      55      NA
2: 1100002       W       R       G       P     161    80.5
3: 1100003       A       K       G       P     129    64.5
4: 1100004       M       P     132      66      NA      NA
5: 1100005       A       P  Absent  Absent      NA      NA

Solution

  • 1.Assuming the rule for separation are: a) more than one space b) numbers follwing a letter and space are to be split

    2.We make sure all soon to be varibles are separated by more than one space (using gsub to replace one space between a letter and a number with two spaces)

    3.Then we use tidyr::separate to separate the string into variables using two or more spaces as deliminiator

    library(dplyr)
    library(tidyr)
    df1 %>% 
      mutate(value = gsub("([A-z])( )([0-9])", "\\1  \\3", value)) %>% 
      separate(value, c(paste0("v", 1:5)),"[ ]{2,}")
    

    Returns:

           v1  v2  v3     v4     v5
    1 1100001 A R   P    110     55
    2 1100002 W R G P    161   80.5
    3 1100003 A K G P    129   64.5
    4 1100004   M   P    132     66
    5 1100005   A   P Absent Absent
    

    Edit

    Regarding the new constraint that was not included in the original example (only one space between a number and a letter):

    Suggested solution:

    We just repeat the command for adding an extra space with "reversed" regex. So that any single space between a number and a letter will get an extra space and will in then be separated by the separate call

    df1 %>% 
      mutate(value = gsub("([A-z])( )([0-9])", "\\1  \\3", value)) %>% 
      mutate(value = gsub("([0-9])( )([A-z])", "\\1  \\3", value)) %>% 
      separate(value, c(paste0("v", 1:5)),"[ ]{2,}")