javaarabic

how could i remove arabic punctuation form a String in java


i am working on an arabic dictionary and i am getting sentences like
String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'"; from my database but i cant process the sentence without removing the accents and punctuation

i tried using

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public static String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
} 

but it didnt work


Solution

  • Why don't you just go for the Unicode punctuation / mark, non-spacing categories?

    Not sure of your expected result as it's not posted - and I can't read Arabic :), but try this code:

    String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
    Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
    Matcher m = p.matcher(input);
    while (m.find()) {
        System.out.println("found: " + m.group());
    }
    m.reset();
    System.out.println("Replaced: " + m.replaceAll(" "));
    

    Output:

    found: '
    found: َ
    found: َ
    found: َ
    found: ُ
    found: ً
    found: :
    found: َ
    found: َ
    found: َ
    found: َ
    found: َ
    found: ّ
    found: َ
    found: َ
    found: .
    found: '
    Replaced:  أ ب ن  ف لان ا  ع اب ه ور م اه بخ ل  ة س وء  
    

    I suppose it's not your desired final result, but I hope it's something you can work with.

    Also, this is a gold mine of information on the Unicode categories. I believe most are applicable in a Java Pattern.