phpunicodeemoji

How to remove invisible characters while removing emojis in Laravel/PHP?


we are using Laravel and a package called maatwebsite/excel for exporting data via XLS for our clients

In a recent issue we faced, the XLS download was broken, most of the data was dissapearing. After debugging closely, we found that 1 one of the data points had this value in it - "England 🏴󠁧󠁢󠁥󠁮󠁧󠁿"

Now we already have a piece of code which stips of any emojis and replaces them with ?? instead so we don't face this problem of exporting broken XLS docs.

But in this case we found something weird. Even when our replaceEmojis method ran on this string, it left invisible characters behind. This is our replaceEmojis code:

if (!function_exists('replaceEmojis')) {
    function replaceEmojis($string, $replaceWith = '??') {
        // Define a regular expression pattern to match emojis
        $pattern = '/[\x{1F600}-\x{1F64F}\x{1F300}-\x{1F5FF}\x{1F680}-\x{1F6FF}\x{1F700}-\x{1F77F}\x{1F780}-\x{1F7FF}\x{1F800}-\x{1F8FF}\x{1F900}-\x{1F9FF}\x{1FA00}-\x{1FA6F}\x{1FA70}-\x{1FAFF}\x{1FAB0}-\x{1FABF}\x{1FAC0}-\x{1FAFF}\x{1FAD0}-\x{1FAD9}\x{1FAD0}-\x{1FAD9}\x{1F300}-\x{1F5FF}\x{1F004}-\x{1F0CF}\x{1F170}-\x{1F251}\x{200D}]+/u';
    
        // Use preg_replace_callback to replace emojis with "??"
        $result = preg_replace_callback($pattern, function ($match) use ($replaceWith) {
            return $replaceWith;
        }, $string);
    
        return $result;
    }
}

And after running that string, we get this back

England ??󠁧󠁢󠁥󠁮󠁧󠁿

But when we paste this in https://www.soscisurvey.de/tools/view-chars.php and view the result, this is what we see:

enter image description here

We have even tried with a different method we found here -> https://stackoverflow.com/a/68155491/2730064

But still we face a similar problem. And as soon as we remove that 🏴󠁧󠁢󠁥󠁮󠁧󠁿 from the data, XLS works fine. So we are sure this is what is causing the issue after all. Any ideas on how to fix this? How can we remove those invisible characters from the string? So it doesn't happen for other emojis in the future?


Solution

  • The problem is that you are thinking in code points and not in glyphs. A glyph can be composed with several code points, for example:

    A chance that pcre has a feature to match a glyph: \X

    So you can rewrite your pattern like that:

    function replaceEmojis(string $string, string $replaceWith = '??'): string {
        $pattern = '~(?xx) # ignore spaces even inside character classes
          (?:
            (?= [ \x{200D}
                  \x{1F004}-\x{1F0CF} \x{1F170}-\x{1F251}
                  \x{1F300}-\x{1F64F} \x{1F680}-\x{1FAFF} ]
            )  # find the position with a lookahead
            \X # match the glyph
          )+ ~u';
    
        return preg_replace($pattern, $replaceWith, $string);
    }
    

    For each position that matches your character class, \X will consume the code points until the end of the glyph.

    (Note that I used the character class from your question, I only joined ranges when it was possible and I do not pretend that this character class is the good one to remove all the emojis of the universe.)