I need a regex to find all sentence-ending periods and ignore middle of the sentence periods, such as in abbreviations. Note: I understand that there are many other variations, and it may not be possible to account for all of them, so the focus of the question would be : can at least the below sample be solved with a regex?
Suppose I have this text. The regex rule below finds any period matches followed by a white space. But it also matches p.m. and U.S. - how can I ignore periods in a word that a) consists of characters all separated by a period? (such as U.S.) and b) a period preceded by one characters only (such as J.). This is in Kotlin.
val text = "At 12.51 p.m. local time, J. Knapp, former U.S. Navy, went out for a walk. Yes he did. And then a Mw6.3 earthquake happened."
val regexRule = "\\.\\s+"
val splitText = text.split(regexRule.toRegex())
val result = splitText.joinToString( separator = ".\n\n")
Current result with just that rule:
At 12.51 p.m.
local time, J.
Knapp, former U.S.
Navy, went out for a walk.
Yes he did.
And then a Mw6.3 earthquake happened.
You can use
val regexRule = "(?<!\\b\\p{L})\\.(?<!\\d.(?=\\d))(?!\\s*\$)\\s*"
See the regex demo.
Details:
(?<!\b\p{L})
- a negative lookbehind: no single letter preceded with a word boundary is allowed immediately to the left of the current location\.
- a dot(?<!\d.(?=\d))
- the dot should not be in-between digits(?!\s*$)
- immediately to the right, there should be no any zero or more whitespaces + the end of the string\s*
- any zero or more whitespaces.