I have a $text = "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
I wanted to remove just emoji's from the text using xquery. How can i do that?
Expected result : "Hello üÀâ$"
i tried to use:
replace($text, '\p{IsEmoticons}+', '')
but didn't work.
it just removed smiley's
Result now: "Hello π ππ» π¦¦ΓΌΓ€ΓΆ$" Expected result : "Hello üÀâ$"
Thanks in advance :)
I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out π.
Quoting from that expanded answer:
The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, π (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:
Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF
This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressionsβa list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.
For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:
replace("Purple π heart", "[🌀-🗿]", "")
This returns the expected result:
Purple Heart
This approach can be applied to ππ» , π¦¦, or any other character:
Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P
Category Escape:
xquery version "3.1";
let $text := "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
return
replace($text, "\P{IsBasicLatin}", "")
This query returns:
Hello $
Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:
xquery version "3.1";
let $text := "Hello πππ π ππ» π¦¦ΓΌΓ€ΓΆ$"
return
replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")
... which returns:
Hello üÀâ$
This now preserves the characters with diacritics.
To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.