pythonregexnlptokenize

Python - RegEx for splitting text into sentences (sentence-tokenizing)


I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

for stuff in sentences:
        print(stuff)    

Example output of what it should look like

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. 
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

Solution

  • (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
    

    Try this. split your string this.You can also check demo.

    http://regex101.com/r/nG1gU7/27

    Another variant.

    (?<!\w\.\w.)(?<!\b[A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?)\s|\\n
    

    https://regex101.com/r/EVUIUT/1