I'm writing code to construct inverted indices from a large number of input files. I've been able to parse the input using
String[] words = value.toString().replaceAll("[^a-zA-Z ]", " ").toLowerCase().split("\\s+");
I ran into some trouble when replacing special characters with the empty string, since that resulted in some words getting merged together, so I replaced them with whitespace in the above code. However, using the above code still doesn't give the output I want, since it separates words with apostrophes into two words
The input files I'm using are varied, some are shakespeare poetry, others are play scripts, etc. I'm having trouble figuring out how to include certain apostrophes in my words but not others.
For example:
the input
'twas, 't,[order'd], king's, o', 'Brutus!', ''At
should return
'twas 't order'd king's o' Brutus At
In other words, I want to keep apostrophes in the case of preceding or ending single apostrophes and words that contain an apostrophe followed by a single letter, while getting rid of a pair of single apostrophes around a word or double apostrophes preceding or following a word. Is there any way to do this or something close to this using a series of regex?
str = str.replaceAll(" *, *", " ")
.replaceAll("[^\\w' !]", "")
.replaceAll("'(\\S*)'", "$1");
See live demo.
\w
means “any word char”
\S
means “any non-whitespace char”
If you want to keep more chars, adjust the regex accordingly.