xqueryxquery-3.0

how can i remove special emoji's using xquery from text


I have a $text = "Hello πŸ˜€πŸ˜ƒπŸ˜„ πŸ’œ πŸ™πŸ» 🦦üÀâ$"

I wanted to remove just emoji's from the text using xquery. How can i do that?

Expected result : "Hello üÀâ$"

i tried to use:

replace($text, '\p{IsEmoticons}+', '')

but didn't work.

it just removed smiley's

Result now: "Hello πŸ’œ πŸ™πŸ» 🦦üÀâ$" Expected result : "Hello üÀâ$"

Thanks in advance :)


Solution

  • I outlined the approach in my answer to the original question, which I updated based on your comment asking about how to strip out πŸ’œ.

    Quoting from that expanded answer:

    The "Emoticons" block doesn't contain all characters commonly associated with "emoji." For example, πŸ’œ (Purple Heart, U+1F49C), according to a site like https://www.compart.com/en/unicode/U+1F49C that lets you look up Unicode character information, is from:

    Miscellaneous Symbols and Pictographs, U+1F300 - U+1F5FF

    This block is not available in XPath or XQuery processors, since it is neither listed in the XML Schema 1.0 spec linked above, nor is it in Unicode block names for use in XSD regular expressionsβ€”a list of blocks that XPath and XQuery processors conforming to XML Schema 1.1 are required to support.

    For characters from blocks not available in XPath or XQuery, you can manually construct character classes. For example, given the purple heart character above, we can match it as follows:

    replace("Purple πŸ’œ heart", "[🌀-🗿]", "")
    

    This returns the expected result:

    Purple  Heart
    

    This approach can be applied to πŸ™πŸ» , 🦦, or any other character:

    1. Locate the character's unicode block.
    2. Craft your regular expression with the block name (if available in XPath) or character class.

    Alternatively, rather than locating the blocks of characters you want to strip out, you could identify the blocks of characters you want to preserve. For example, given the example string in the original post, perhaps the goal is to preserve only those characters in the "Basic Latin" block. To do so, we can match characters NOT in this block via the \P Category Escape:

    xquery version "3.1";
    
    let $text := "Hello πŸ˜€πŸ˜ƒπŸ˜„ πŸ’œ πŸ™πŸ» 🦦üÀâ$"
    return
        replace($text, "\P{IsBasicLatin}", "")
    

    This query returns:

    Hello    $
    

    Notice that this has stripped out the characters with diacritics, which perhaps isn't desired. These characters with diacritics belong to the Latin-1 Supplement block. To preserve characters from both the Latin and Latin-1 Supplement blocks, we'd need to adjust the query as follows:

    xquery version "3.1";
    
    let $text := "Hello πŸ˜€πŸ˜ƒπŸ˜„ πŸ’œ πŸ™πŸ» 🦦üÀâ$"
    return
        replace($text, "[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]", "")
    

    ... which returns:

    Hello    üÀâ$
    

    This now preserves the characters with diacritics.

    To be precise about the characters you preserve or remove, you need to consult the Unicode blocks and charts.