I am making a modulatory function that will take keywords with special characters (@&\*%
) and keep them intact while all other punctuation is deleted from a sentence. I have devised a solution, but it is very bulky and probably more complicated than it needs to be. Is there a way to do this, but in a much simpler way?
In short, my code matches all instances of the special words to find the span. I then match the characters to find their span, and then I loop over the list of matches and remove any characters that also exist in the span of the found words.
Code:
import re
from string import punctuation
sentence = "I am going to run over to Q&A and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
# my attempt to remove punctuation
class SentenceHolder:
sentence = None
protected_words = ["Q&A"]
def __init__(sentence):
self.sentence = sentence
def remove_punctuation(self):
for punct in punctuation:
symbol_matches: List[re.Match] = [i for i in re.finditer(punct, self.sentence)]
remove_able_matches = self._protected_word_overlap(symbol_matches)
for word in reversed(remove_able_word_matches):
self.sentence = (self.modified_string[:word.start()] + " " + self.sentence[word.end():])
def _protected_word_overlap(symbol_matches)
protected_word_locations = []
for protected_word in self.protected_words :
protected_word_locations.extend([i for i in re.finditer(protected_word, self.sentence)])
protected_matches = []
for protected_word in protected_word_locations:
for symbol_inst in symbol_matches:
symbol_range: range = range(symbol_inst.start(), symbol_inst.end())
protested_word_set = set(range(protected_word.start(), protected_word.end()))
if len(protested_word_set.intersection(symbol_range)) != 0:
protected_matches.append(symbol_inst)
remove_able_matches = [sm for sm in symbol_matches if sm not in protected_matches]
return remove_able_matches
The output of the code:
my_string = SentenceHolder(sentence)
my_string.remove_punctuation()
Result:
"I am going to run over to Q&A and ask them a ton of questions about this that that this while surfacing the internet with my raccoon buddy the bar"
I tried to use regex and pattern to identify all the locations of the punctuation, but the pattern I use in re.sub
does not work similarly in re.match
.
probably not the best, but really simple
protected = ["Q&A", "stack@exchange"]
protected_dict = {f'protected{i}': p_word for i, p_word in enumerate(protected)}
sentence = "I am going to run over to Q&A stack@exchange and ask them a ton of questions about this & that & that & this while surfacing the internet! with my raccoon buddy @ the bar."
# protect
for k, v in protected_dict.items():
sentence = sentence.replace(v, k)
# replace stuff
sentence = sentence.replace('&', '')
sentence = sentence.replace('@', '')
# revert back protected words
for k, v in protected_dict.items():
sentence = sentence.replace(k, v)
print(sentence) # I am going to run over to Q&A stack@exchange and ask them a ton of questions about this that that this while surfacing the internet! with my raccoon buddy the bar.