I would like to extract information from one column and substitute it in another existing column. So this is what happened. I have a variable that identifies country_year, that I divided into two columns, country and year. So, just for example:
id Country year
AUS_1999 AUS 1999
CAN_1999 CAN 1999
AUS_2000 AUS 2000
CAN_2000 CAN 2000
BELS1999 BELS1999 NA
In the example, notice that the fifth observation was not separated by an "_", because the code that I used to separate the id column. It ended up with a missing value in the year column and the country column is also wrong. There are a few of these observations in my data frame. How can I correct it for all these observations, by extracting information from the id column and adding to existing columns that I already created (country and year)?
I tried to be as clear as possible, let me know if you need more information.
We could use a regex lookaround to separate
the column 'id'
library(dplyr)
library(tidyr)
df1 %>%
separate(id, into = c("Country", "year"),
sep = "_|(?<=[A-Z])(?=\\d)", remove = FALSE)
-output
id Country year
1 AUS_1999 AUS 1999
2 CAN_1999 CAN 1999
3 AUS_2000 AUS 2000
4 CAN_2000 CAN 2000
5 BELS1999 BELS 1999
Or with extract
df1 %>%
extract(id, into = c("Country", "year"), "^([A-Z]+)_?(\\d+)", remove = FALSE)
id Country year
1 AUS_1999 AUS 1999
2 CAN_1999 CAN 1999
3 AUS_2000 AUS 2000
4 CAN_2000 CAN 2000
5 BELS1999 BELS 1999
Or in base R
, insert a _
where there are none between the uppercase letter and a digit to read it with read.table
into two columns
cbind(df1, read.table(text = sub("([A-Z])(\\d)", "\\1_\\2", df1$id),
header = FALSE, sep = "_", col.names = c("Country", "year")))
-output
id Country year
1 AUS_1999 AUS 1999
2 CAN_1999 CAN 1999
3 AUS_2000 AUS 2000
4 CAN_2000 CAN 2000
5 BELS1999 BELS 1999
df1 <- structure(list(id = c("AUS_1999", "CAN_1999", "AUS_2000", "CAN_2000",
"BELS1999")), row.names = c(NA, -5L), class = "data.frame")