rstringdata-cleaning

Remove middle name and initial from a column and save in a separate column in R


I have a column of names; some have middle names or middle initials. I want to remove those middle initials from the fullname column and create a new column next to this column to store these middle names/initials.

Based on my research this post provides some solutions to remove the middle names. How do I move those middle names/initials to a new column?

The following is a 40 row sample of my data using dput() The data has 43 million rows and is approximately 3.8 GB.

data<- structure(list(id = c(439116595, 439317458, 439373574, 439434694, 
439508848, 439632143, 439778306, 439917155, 440009485, 440147556, 
440207880, 441479247, 442115059, 441569787, 438192228, 438215998, 
438307365, 438317476, 438389110, 438409963, 438479736, 438509859, 
438634859, 438662407, 438764944, 438846700, 438884094, 438954147, 
439227370, 439243020, 439248564, 439272667, 439357884, 439403127, 
439446363, 439511276, 439546441, 439586141, 439804213, 439862550, 
439889286), fullname = c("Shawn Chase", "Steven Hofer", "Stephen Paradise", 
"Shengho Yang", "Nelson Carvalho", "RICK GILLICK", "Marie Jhoanne Morilla", 
"Sanjay Kulkarni", "Sam Bunn", "Iran Murphy", "Kathryn Cutler", 
"Diane Sik", "Donna Yee", "Christine Coltrain", "Maher Dakkouri", 
"Ray Perl", "Abid Khalil", "Ian Crombie", "Allen Carr", "Daniel Angeline", 
"Jimmy Tan", "Thierry LAMBERT", "Diene Faye", "Greg Greene", 
"Laura Holsopple", "Roberta Minkus", "Bridget Chenette", "Joshua Polite", 
"John Liberty", "David Smith", "Igor Baratta", "Pierre Schmitz", 
"alejandra salvanes", "Malcolm K Knight", "Xiaoyan Hu", "Joe Pawl", 
"Bryan Armstrong", "Christina Spezio", "Robert Gibson", "Peter Head", 
"Mike Russo"), degree = c("Bachelor", "Bachelor", "MBA", 
"", "", "", "", "Master", "", "MBA", "", "", "Bachelor", "Doctor", 
"Bachelor", "Master", "Master", "", "", "", "", "", "", "Bachelor", 
"Master", "", "Master", "", "Bachelor", "Master", "", "", "", 
"", "Bachelor", "Associate", "", "Bachelor", "", "", "Associate"
)), row.names = 20:60, class = "data.frame")

I have tried the following on a small subset of my data. However, there was an error message:

Warning message: In stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing

str_match(data, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]

Update (and solution):

I was able to edit the above approach with the help of @Ben Bolker to the following which worked very fast! This approach removes all the middle words, for example if someone has two middle names or if there are unusual cases. This works well for my case.

 names<- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]  
 names<-data.frame(no_mid_name = paste(names[,1], names[,3]),
                mid_name = names[,2])

I then used cbind() to add the resulting columns to my original dataset.


Solution

  • The OP's str_match solution is actually slightly faster than @ThomasIsCoding's (perhaps because it avoids doing the regular expression search twice). The full problem is about a million times larger than the example given here (43 million vs. 41 rows); since an iteration only takes 200 µs, we would expect the whole thing to be done in 200 seconds assuming linear scaling in the number of rows (and that I've done the arithmetic correctly). Therefore, there might be something else going on ... one probably needs to (1) do some experiments to see how these solutions scale with increasing problem size and (2) possibly be more careful with memory management etc., e.g. look for a solution using data.table ...

    thomas <- function(data) with(data,
          data.frame(no_mid_name = sub("^(\\w+).*?(\\w+)$", "\\1 \\2", fullname),
                     mid_name = trimws(sub("^\\w+(.*?)\\w+$", "\\1", fullname))))
    
    OP <- function(data) {
        ss <- stringr::str_match(data$fullname, '^(\\S+)\\s*(.*?)\\s*(\\S+)$')[,-1]
        data.frame(no_mid_name = paste(ss[,1], ss[,3]),
                   mid_name = ss[,2])
    }
    
    bench::mark(thomas(data), OP(data))  |>
     dplyr::select(expression, median, `itr/sec`)
    
    ##   expression   median `itr/sec`
    ##  <bch:expr> <bch:tm>     <dbl>
    ## 1 thomas(dd)    289µs     3351.
    ## 2 OP(dd)        187µs     5266.