regexunicodeword-boundaryword-boundaries

A regular expression for \b


I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.

Example: Suppose I have the string: "कमल कमाल कम्हल कम्हाल" Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word. This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.

I hope the example helps.

Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.


Solution

  • You should be able to accomplish what you want with the following regex operators:

    (?=X)   X, via zero-width positive lookahead
    (?!X)   X, via zero-width negative lookahead
    (?<=X)  X, via zero-width positive lookbehind
    (?<!X)  X, via zero-width negative lookbehind
    

    (The above is quoted from the Java 6 Pattern API docs.)

    Use (?<![foo])(?=[foo]) in place of \b before a word, and (?<=[foo])(?![foo]) in place of \b after a word, where "[foo]" is your set of "word characters"