rdataframevariablesdplyrsplit

How to split a column into two columns to separate two values in a cell when some cells don't contain two values to split using dplyr in R


Issue

I want to split a column named 'Date' in my data frame into two columns called 'Date' and 'Survey_Number'

The column 'Date' either contains one date '08/04/2012' or a date and survey number '24/04/2024 [S2]' (see below).

I can split the cell using dplyr (code below); however, some cells don't contain two values to split, and I'm finding the date and survey number tend to be altered. Additionally, if only one survey took place that day, the survey number associated with this date would be zero in the new column 'Survey_Number'.

I've tried using the separate function with dplyr and the str_split_fixed function with stringr.

What can I try next?

R-Code

#Split the cell 'Date'to create two columns called 'Date' and 'Survey Number'

library(dplyr)
library(tidyr)

# Split name column into Date and Survey_Number

Df <- Df %>% separate(Date, c('Date', 'Survey_Number'))
Df

Altered Dates and Survey Numbers

  Df$Survey_Number
      [1] "01" "02" "02" "02" "02" "02" "02" "02" "02" "03" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04" "04"
     [28] "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" "05" 

  Df$Date
  [1] "08" "06" "07" "12" "13" "14" "15" "20" "22" "23" "08" "15" "15" "16" "16" "20" "21" "21" "23" "24" "24" "27" "28" "29" "29" "29" "30"
 [28] "02" "03" "03" "04" "04" "05" "05" "05" "08" "09" "10" "13" "23" "26" "26" "27" 

Expected Output

enter image description here

Data frame

 structure(list(Survey_No = 1:20, Date = c("08/01/2012", "06/02/2012", 
    "07/02/2012", "12/02/2012", "13/02/2012", "14/02/2012", "15/02/2012", 
    "20/02/2012", "22/02/2012", "23/03/2012", "08/04/2012", "15/04/2012 [S1]", 
    "15/04/2012 [S2]", "16/04/2012 [S1]", "16/04/2012 [S2]", "20/04/2012", 
    "21/04/2012 [S1]", "21/04/2012 [S2]", "23/04/2012", "24/04/2024 [S1]"
    )), row.names = c(NA, -20L), class = "data.frame")

Solution

  • You can do it with gsub() which will search/replace exactly the pattern in your text you tell it to look for.

    Here's an example:

    df <- data.frame(Survey_No=1:20,Date=as.factor(c('08/01/2012','06/02/2012','07/02/2012',
              '12/02/2012','13/02/2012','14/02/2012','15/02/2012','20/02/2012','22/02/2012',
              '23/03/2012','08/04/2012','15/04/2012 [S1]','15/04/2012 [S2]','16/04/2012 [S1]',
              '16/04/2012 [S2]','20/04/2012','21/04/2012 [S1]','21/04/2012 [S2]','23/04/2012',
              '24/04/2024 [S1]')))
    
    df$Date <- as.character(df$Date) # Later steps will treat date as a character, convert it from a factor
    df$date_only <- gsub('\\s.*','',df$Date)
    df$survnum_only <- ifelse(grepl('\\s',df$Date),gsub('.*\\s','',df$Date),NA)
    

    The line df$date_only <- gsub('\\s.*','',df$Date) will match a space (\\s) and any characters that come after it (.*) and replace them with nothing, giving you a date-only column.

    The line df$survnum_only <- ifelse(grepl('\\s',df$Date),gsub('.*\\s','',df$Date),NA) first checks if df$Date contains a space (grepl('\\s',df$Date)) -- if it doesn't, it sets survnum_only to NA. In rows where df$Date does contain a space, it uses gsub('.*\\s','',df$Date) to remove any characters before the space plus the space itself (.*\\s).

    Before doing any of that, it converts the Date variable from a factor (the data type it had in your original example) to a character variable, to make sure that functions that expect a character variable will be able to process it correctly.