javaregexlatin

Detect non Latin characters with regex Pattern in Java


I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results

"abcDE 123";  // Yes, this should match
"!@#$%^&*";   // Yes, this should match
"aaàààäää";   // Yes, this should match
"ベビードラ";   // No, this shouldn't match
"😀😃😄😆";  // No, this shouldn't match

My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.

Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
    System.out.println("is NON latin");
    return;
}
System.out.println("is latin");

Solution

  • TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$


    You want a regex that matches if the string consists of:

    Easiest way is to combine \p{IsLatin} with \p{Print}, where Pattern defines \p{Print} as:

    Which makes \p{Print} the same as [\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.

    The \p{Alpha} part overlaps with \p{IsLatin}, but that's fine, since the character class eliminates duplicates.

    So, regex is: ^[\p{Print}\p{IsLatin}]*$

    Test

    Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");
    
    String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "😀😃😄😆" };
    for (String input : inputs) {
        System.out.print("\"" + input + "\": ");
        Matcher matcher = latinPattern.matcher(input);
        if (! matcher.find()) {
            System.out.println("is NON latin");
        } else {
            System.out.println("is latin");
        }
    }
    

    Output

    "abcDE 123": is latin
    "!@#$%^&*": is latin
    "aaàààäää": is latin
    "ベビードラ": is NON latin
    "😀😃😄😆": is NON latin