I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.
Example: Suppose I have the string: "कमल कमाल कम्हल कम्हाल" Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word. This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.
I hope the example helps.
Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.
You should be able to accomplish what you want with the following regex operators:
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
(?<=X) X, via zero-width positive lookbehind
(?<!X) X, via zero-width negative lookbehind
(The above is quoted from the Java 6 Pattern API docs.)
Use (?<![foo])(?=[foo])
in place of \b
before a word, and (?<=[foo])(?![foo])
in place of \b
after a word, where "[foo]
" is your set of "word characters"