rstringduplicatescpu-word

Remove duplicate names while replacing underscores with spaces in R


I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help

Input data:

> dat0
                  name_ko
1           BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3           BAD_BOY,_GOOD  
4      GOOD_BOY,_BAD_GIRL  

Desired output:

> dat1
              name_ok
1           BLA, BLIM
2 CLO, SPITCH SPLOTCH
3       BAD BOY, GOOD
4  GOOD BOY, BAD GIRL

Data:

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)

Solution

  • You can try

    name_ok = gsub("_"," ",gsub("(\\b\\w+)_(\\1)", "\\1",name_ko))
    
     "BLA, BLIM"          
     "CLO, SPITCH SPLOTCH" 
     "BAD BOY, GOOD"  
     "GOOD BOY, BAD GIRL"
    

    To handle triplets and more as Margusl and zephryl suggested - thank you

    name_ko <- c(
      "BLA_BLA,_BLIM",
      "CLO_CLO,_SPITCH_SPLOTCH",
      "BAD_BOY,_GOOD",
      "GOOD_BOY,_BAD_GIRL",
      "BAD_BAD_BAD_BOY_BOY,_GOOD",
      "BAD_BOY_BAD_BOY,_GOOD"
    )
    
    name_ok = sapply(strsplit(name_ko, ","), function(x) {
      last_names <- unique(unlist(strsplit(trimws(x[1]), "_"))) 
      first_names <- gsub("_", " ",trimws(x[2]))   
      paste(paste(last_names, collapse = " "), first_names, sep = ", ")
    })
    
    "BLA,  BLIM"          
    "CLO,  SPITCH SPLOTCH" 
    "BAD BOY,  GOOD"      
    "GOOD BOY,  BAD GIRL"  
    "BAD BOY,  GOOD"
    "BAD BOY,  GOOD"