rdataframestrsplit

Conditionally split strings cells in a dataframe R


I have a dataset with species name where some names originally used are now obsolete, so they are noted "old_speciesretired use new_species", whereas correct cells are just noted "new_species".

Here is a sample of the data :

df<- data.frame(species=c("Etheostoma spectabile","Ictalurus furcatus","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus","Ictalurus punctatus","Ictalurus punctatus","Micropterus salmoides","Etheostoma olmstedi","Noturus insignis","Lepomis auritus","Lepomis auritus","Nocomis leptocephalus","Scartomyzon rupiscartes***retired***use Moxostoma rupiscartes","Lepomis cyanellus","Notropis chlorocephalus","Scartomyzon cervinus***retired***use Moxostoma cervinum","Ictalurus punctatus","Lythrurus ardens","Moxostoma pappillosum","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus"))

I have tried

sapply(strsplit(df$species, split='***retired***use', fixed = T),function(x) (x[2])))

but the cells for which the data is correct returns NA because they do not contain the split.

Is there a way to make the split just for the cells actually containing it?


Solution

  • You can change the old names to the new names using gsub plus backreference:

    gsub(".*\\*\\*\\*retired\\*\\*\\*use\\s(.*)", "\\1", df$species)
    
    # [1] "Etheostoma spectabile"   "Ictalurus furcatus"      "Micropterus salmoides"   "Micropterus salmoides"  
    # [5] "Ictalurus punctatus"     "Ictalurus punctatus"     "Ictalurus punctatus"     "Micropterus salmoides"  
    # [9] "Etheostoma olmstedi"     "Noturus insignis"        "Lepomis auritus"         "Lepomis auritus"        
    # [13] "Nocomis leptocephalus"   "Moxostoma rupiscartes"   "Lepomis cyanellus"       "Notropis chlorocephalus"
    # [17] "Moxostoma cervinum"      "Ictalurus punctatus"     "Lythrurus ardens"        "Moxostoma pappillosum"  
    # [21] "Micropterus salmoides"   "Micropterus salmoides"   "Ictalurus punctatus" 
    

    Explanation:

    .* anything any number of times followed by ...

    \\*\\*\\*retired\\*\\*\\*use\\s ... the literal pattern ***retired***use followed by ...

    (.*) ... anything any number of times--that's the capturing group that the backreference \\1 in the replacement argument of gsubrefers to