I'm working on a side project to apply NLP to clinical data, and I'm using Java's BreakIterator to divide text into sentences for further analysis. When using BreakIterator, I'm coming across a problem where BreakIterator doesn't recognize sentences that start with a numerical value.
Example:
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."
Expected Output:
1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
Actual Output:
1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
Code:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String[] args) {
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
Locale locale = Locale.US;
BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
splitIntoSentences.setText(text);
int index = 0;
while (splitIntoSentences.next() != BreakIterator.DONE) {
String sentence = text.substring(index, splitIntoSentences.current());
System.out.println(sentence);
index = splitIntoSentences.current();
}
}
}
Any help would be appreciated. I was trying to find an answer online but to no avail.
Instead of using BreakIterator, I'm now using Apache OpenNLP and it works great!