javaregexmatch

Hard time figuring out the correct regex to match uppercase words


I have a simple requirement. We use the hibernate validation engine to figure out if a constraint is true or false.

True should be a text if all the words starts with an uppercase character. There are some difficulties:

Words could also start like this

8-Test
or even
8Test
or even
(Test)
or even
-Test
or anything comparable
Also usually they are comma separated (or a different separator):
Test, Test, Test
Remember I only want to make sure that words in the String starts uppercase. When you see my tries, probably I am overcomplicating things.

Here are some samples: Expected to match all (true):

- Hydroxyisohexyl 3-Cyclohexene Carboxaldehyde, Benzyl
- Test, Test, Test
- CI 15510, Methylchloroisothiazolinone, Disodium EDTA
- N/A
- NA
Expected to not match all (false):
- hydroxyisohexyl 3-Cyclohexene Carboxaldehyde, Benzyl
- Test, test, Test
- CI 15510, Methylchloroisothiazolinone, Disodium eDTA
- na
- n/a
My tries were going into this directions:

final String oldregex = "([\\W]*\\b[A-Z\\d]\\w+\\b[\\W]*)+";
final String regex = "([A-Z][\\d\\w]+( [A-Z][-\\d\\w]+)*, )*[A-Z][-\\d\\w]+( [A-Z][-\\d\\w]+)*\\.";'
actually with "oldregex" option I ran into an infinitive calculation for some texts Use this to test regex: http://gskinner.com/RegExr/ (without double backslashes of course)

Thanks for helping!!!


Solution

  • Regex

    See it in action:

    ^(?:[^A-Za-z]*[A-Z][^\s,]*)*[^A-Za-z]*$
    

    Explanation

    ^                # start of the string
    (?:              # this group matches a "word", don't capture the group
      [^A-Za-z]*     # skip any non-alphabet characters at start of the word
      [A-Z]          # force an uppercase letter as a first letter
      [^\s,]*        # match anything but word separators (\s and ,) after 1th letter
    )*               # the whole line consists of such "words"
    [^A-Za-z]*       # skip any non-alphabet characters at the end of the string
    $                # end of the string
    

    Note: You can modify the regex if your word separator characters different then whitespace and comma. (For example, change [^\s,] to [^,:-] or whatever you use)