javaregexstringemoji

Remove ✅, 🔥, ✈ , ♛ and other such emojis/images/signs from Java strings


I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??
→ Cats and dogs
I'm on 🔥
Apples ⚛ 
✅ Vi sign
♛ I'm the king ♛ 
Corée ♦ du Nord ☁  (French)
 gjør at både ◄╗ (Norwegian)
Star me ★
Star ⭐ once more
早上好 ♛ (Chinese)
Καλημέρα ✂ (Greek)
another ✓ sign ✓
добрай раніцы ✪ (Belarus)
◄ शुभ प्रभात ◄ (Hindi)
✪ ✰ ❈ ❧ Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.❉

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ♦ sign is the only one I found till now that it removed. Other signs such as ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ 🔥 are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?


Solution

  • Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

    String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
    String emotionless = aString.replaceAll(characterFilter,"");
    

    So:

    Example:

    String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。🔥";
    System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));
    // Output:
    //   "hello world _# 皆さん、こんにちは! 私はジョンと申します。"
    

    If you need more information, check out the Java documentation for regexes.