pythonarraysregexstring

How can I replace every instance of a character from 3 groups of characters with just 3 different characters respectively?


This is my input:

"Once there     was a (so-called) rock. it.,was not! in fact, a big rock."

I need it to output an array that looks like this

["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]

There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:

spaceDelimiters  = " -_" 
commaDelimiters  = ",():;\""
periodDelimiters = ".!?"

If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma

I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces

Maybe my approach is just wrong? This is what I've got:

def split(string, delimiters):
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string)

Which ends up doing everything wrong. It's not even close


Solution

  • Use the re library to split text on word boundaries, then replace in sequence of precident

    import re
    
    s="Once there     was a (so-called) rock. it.,was not! in fact, a big rock."
    
    # split regex into tokens along word boundaries
    regex=r"\b"
    
    l=re.split(regex,s)
    
    def replaceDelimeters(token:str):
        
        # in each token identify if it contains a delimeter
        spaceDelimiters  = r"[^- _]*[- _]+[^- _]*" 
        commaDelimiters  = r"[^,():;\"]*[,():;\"]+[^,():;\"]*"
        periodDelimiters = r"[^.!?]*[.!?]+[^.!?]*"
        
        # substitute for the replacement
        token=re.sub(periodDelimiters,".",token)
        token=re.sub(commaDelimiters,",",token)
        token=re.sub(spaceDelimiters," ",token)
        return token
    
    # apply
    [replaceDelimeters(token) for token in l if token!=""]
    

    This method returns "." as the last entry to the list. I don't know if this is your desired behavior; your desired output states otherwise, but your logic appears to desire this. Deleting the last entry if it is a period should be easy enough in any case.