i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+
Match 1 or more whitespace chars|
Or(?<=\s)'
Match '
preceded by a whitespace char|
Or'(?=\s)
Match '
when followed by a whitespace char|
Or(?<=\w)([,.!?])
Capture one of ,
.
!
?
in group 1, when preceded by a word characterSee a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]