This is my input:
"Once there was a (so-called) rock. it.,was not! in fact, a big rock."
I need it to output an array that looks like this
["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]
There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:
spaceDelimiters = " -_"
commaDelimiters = ",():;\""
periodDelimiters = ".!?"
If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma
I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces
Maybe my approach is just wrong? This is what I've got:
def split(string, delimiters):
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string)
Which ends up doing everything wrong. It's not even close
Use the re
library to split text on word boundaries, then replace in sequence of precident
import re
s="Once there was a (so-called) rock. it.,was not! in fact, a big rock."
# split regex into tokens along word boundaries
regex=r"\b"
l=re.split(regex,s)
def replaceDelimeters(token:str):
# in each token identify if it contains a delimeter
spaceDelimiters = r"[^- _]*[- _]+[^- _]*"
commaDelimiters = r"[^,():;\"]*[,():;\"]+[^,():;\"]*"
periodDelimiters = r"[^.!?]*[.!?]+[^.!?]*"
# substitute for the replacement
token=re.sub(periodDelimiters,".",token)
token=re.sub(commaDelimiters,",",token)
token=re.sub(spaceDelimiters," ",token)
return token
# apply
[replaceDelimeters(token) for token in l if token!=""]
This method returns "." as the last entry to the list. I don't know if this is your desired behavior; your desired output states otherwise, but your logic appears to desire this. Deleting the last entry if it is a period should be easy enough in any case.