rdataframeextractcolumnsorting

How can I extract characters from one column and add it into another existing column in R?


I would like to extract information from one column and substitute it in another existing column. So this is what happened. I have a variable that identifies country_year, that I divided into two columns, country and year. So, just for example:

id           Country    year   
AUS_1999     AUS        1999    
CAN_1999     CAN        1999    
AUS_2000     AUS        2000     
CAN_2000     CAN        2000    
BELS1999     BELS1999   NA

In the example, notice that the fifth observation was not separated by an "_", because the code that I used to separate the id column. It ended up with a missing value in the year column and the country column is also wrong. There are a few of these observations in my data frame. How can I correct it for all these observations, by extracting information from the id column and adding to existing columns that I already created (country and year)?

I tried to be as clear as possible, let me know if you need more information.


Solution

  • We could use a regex lookaround to separate the column 'id'

    library(dplyr)
    library(tidyr)
    df1 %>%
       separate(id, into = c("Country", "year"), 
           sep = "_|(?<=[A-Z])(?=\\d)", remove = FALSE)
    

    -output

            id Country year
    1 AUS_1999     AUS 1999
    2 CAN_1999     CAN 1999
    3 AUS_2000     AUS 2000
    4 CAN_2000     CAN 2000
    5 BELS1999    BELS 1999
    

    Or with extract

    df1 %>% 
      extract(id, into = c("Country", "year"), "^([A-Z]+)_?(\\d+)", remove = FALSE)
            id Country year
    1 AUS_1999     AUS 1999
    2 CAN_1999     CAN 1999
    3 AUS_2000     AUS 2000
    4 CAN_2000     CAN 2000
    5 BELS1999    BELS 1999
    

    Or in base R, insert a _ where there are none between the uppercase letter and a digit to read it with read.table into two columns

    cbind(df1, read.table(text = sub("([A-Z])(\\d)", "\\1_\\2", df1$id), 
       header = FALSE, sep = "_", col.names = c("Country", "year")))
    

    -output

            id Country year
    1 AUS_1999     AUS 1999
    2 CAN_1999     CAN 1999
    3 AUS_2000     AUS 2000
    4 CAN_2000     CAN 2000
    5 BELS1999    BELS 1999
    

    data

    df1 <- structure(list(id = c("AUS_1999", "CAN_1999", "AUS_2000", "CAN_2000", 
    "BELS1999")), row.names = c(NA, -5L), class = "data.frame")