I'm indexing documents which contain normal text, programming code and other non-linguistic fragments. For reasons which aren't particularly relevant I am trying to tokenise the content into lowercased strings of normal language, and single character symbols.
Thus the input
a few words. Cost*count
should generate the tokens
[a] [few] [words] [.] [cost] [*] [count]
Thus far thus extremely straightforward. But I want to handle "compound" words too, because the content can include words like order_date and object-oriented and class.method as well.
I'm following the principle that any of [-], [_] and [.] should be treated as a compound word conjunction rather than a symbol IF they are between two word characters, and should be treated as a separate symbol character if they are adjacent to a space, another symbol character, or the beginning or end of a string. I can handle all of this with a PatternTokenizer, like so:
public static final String tokenRgx = "(([A-Za-z0-9]+[-_.])*[A-Za-z0-9]+)|[^A-Za-z0-9\\s]{1}";
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRgx), 0);
TokenStream result = new LowerCaseFilter(src);
return new TokenStreamComponents(src, result);
}
This successfully distinguishes between full stops at the end of sentences and full stops in compounds, between hyphens introducing negative numbers and hyphenated words, etc. So in the above analyzer, the input:
a few words. class.simple_method_name. dd-mm-yyyy.
produces the tokens
[a] [few] [words] [.] [class.simple_method_name] [.] [dd-mm-yyyy] [.]
We're almost there, but not quite. Finally I want to split the compound terms into their parts RETAINING the trailing hyphen/underscore/stop character in each part. So I think I need to introduce another filter step to my analyzer so that the final set of tokens I end up with is this:
[a] [few] [words] [.] [class.] [simple_] [method_] [name] [.] [dd-] [mm-] [yyyy] [.]
And this is the piece that I am having trouble with. I presume that some kind of PatternCaptureGroupTokenFilter is required here but I haven't been able to find the right set of expressions to get the exact tokens I want emerging from the analyzer.
I know it must be possible, but I seem to have walked into a regular expression wall that blocks me. I need a flash of insight or a hint, if anyone can offer me one.
Thanks, T
Edit: Thanks to @rici for pointing me towards the solution
The string which works (including support for decimal numbers) is:
String tokenRegex = "-?[0-9]+\\.[0-9]+|[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]";
Seems to me like it would be easier to do the whole thing in one scan, using a regex like:
[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]
That uses a zero-width forward assertion in order to only add [-._]
to the preceding word if it is immediately followed by a letter or digit. (Because (?=…)
is an assertion, it doesn't include the following character in the match.)
To my mind, that won't correctly handle decimal numbers; -3.14159
will be three tokens rather than a single number token. But it depends on your precise needs.