rregexstring-substitution

Remove all punctuation AND the values after it at end of string in R


I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number - e.g. -1, /a, _1 etc.

I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.

I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don't need to check for different arrangements?

On someone else's question I managed to find a way to remove the brackets and all the text within the brackets, but I can't seem to figure out how to manipulate it for my purposes

df$patid<- gsub("\\s*\\([^\\)]+\\)","",df$patid)

I tried these two codes without success

df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)

I also tried the clean function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn't it.

example of my current code (not all possible iterations) - These do work

df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)

Am not tied to gsub, am open to different methods.

Example of ID numbers

patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

Apologies if this has been answered somewhere already


Solution

  • A literal regex for what you described would be:

    [[:punct:]][^[:punct:]]*$
    

    This would match a final punctuation character, followed by anything which follows it, until the end of the string.

    patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
    output <- sub("[[:punct:]][^[:punct:]]*$", "", patid)
    output
    
     [1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
     [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
    [11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
    [16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
    [21] "G140381883"   "G140880774"   "G140880774"