I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number - e.g. -1, /a, _1 etc.
I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.
I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don't need to check for different arrangements?
On someone else's question I managed to find a way to remove the brackets and all the text within the brackets, but I can't seem to figure out how to manipulate it for my purposes
df$patid<- gsub("\\s*\\([^\\)]+\\)","",df$patid)
I tried these two codes without success
df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)
I also tried the clean
function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn't it.
example of my current code (not all possible iterations) - These do work
df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)
Am not tied to gsub
, am open to different methods.
Example of ID numbers
patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
Apologies if this has been answered somewhere already
A literal regex for what you described would be:
[[:punct:]][^[:punct:]]*$
This would match a final punctuation character, followed by anything which follows it, until the end of the string.
patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
output <- sub("[[:punct:]][^[:punct:]]*$", "", patid)
output
[1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
[6] "570548260" "570548260" "1458629P" "1139093D" "8253015N"
[11] "8253015N" "M255858" "M255858" "8494392Q" "9296741B"
[16] "04152341421" "04152341421" "04152640475" "04152821164" "G140381883"
[21] "G140381883" "G140880774" "G140880774"