javaregexunicodeturkish

How I can use Java Regex for Turkish characters to UTF-8


I'm trying to do a regex operations in Java. But when I search in the Turkish text , I'm having trouble . For example;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir". 

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. Or How do I convert Turkish characters(Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim).

Sorry, about my grammer mistakes!...


Solution

  • Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag:

    Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    

    Demo on ideone

    Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters.

    Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. Therefore, ı, i, I, İ are considered the same character.

    Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote()