rdata-cleaningstringrcountries

Replace multiple values using a reference table


I’m cleaning a data base, one of the fields is “country” however the country names in my data base do not match the output I need.

I though of using str_replace function but I have over 50 countries that need to be fix, so it’s not the most efficient way. I already prepared a CSV file with the original country input and the output I need for reference.

Here is what I have so far:

library(stringr)
library(dplyr)
library(tidyr)
library(readxl)
database1<- read_excel("database.xlsx") 
database1$country<str_replace(database1$country,"USA","United States")
database1$country<str_replace(database1$country,"UK","United Kingdom")
database1$country<str_replace(database1$country,"Bolivia","Bolivia,Plurinational State of")
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")

Solution

  • Note: levels and labels inside the factor must be unique or it should not contain duplicates.

    # database1 <- read_excel("database.xlsx")  ## read database excel book
    old_names <- c("USA", "UGA", "CHL") ## country abbreviations
    new_names <- c("United States", "Uganda", "Chile")  ## country full form
    

    base R

    database1 <- within( database1, country <- factor( country, levels = old_names, labels = new_names ))
    

    Data.Table

    library('data.table')
    setDT(database1)
    database1[, country := factor(country, levels = old_names, labels = new_names)]
    
    database1
    #          country
    # 1: United States
    # 2:        Uganda
    # 3:         Chile
    # 4: United States
    # 5:        Uganda
    # 6:         Chile
    # 7: United States
    # 8:        Uganda
    # 9:         Chile
    

    Data

    database1 <- data.frame(country = c("USA", "UGA", "CHL", "USA", "UGA", "CHL", "USA", "UGA", "CHL"))
    #    country
    # 1     USA
    # 2     UGA
    # 3     CHL
    # 4     USA
    # 5     UGA
    # 6     CHL
    # 7     USA
    # 8     UGA
    # 9     CHL
    

    EDIT: You can create one named vector countries, instead of two variables such as old_names and new_names.

    countries <- c("USA", "UGA", "CHL")
    names(countries) <- c("United States", "Uganda", "Chile")
    within( database1, country <- factor( country, levels = countries, labels = names(countries) ))