regexsecurity

Why does this text bypass the RegEx scanning?


A scammer is mass messaging our users using this form of messages

𝕀 π•’π•ž 𝕧𝕖𝕣π•ͺ π•šπ•Ÿπ•₯𝕖𝕣𝕖𝕀π•₯𝕖𝕕. (𝟞𝟟𝟠) - π•₯𝕖𝕩π•₯ π•žπ•– π•Ÿπ• π•¨. 𝕀 π•¨π•’π•Ÿπ•₯ π•₯𝕠 π•‘π•šπ•”π•œ 𝕦𝕑 π”Έπ•Šπ”Έβ„™. ℍ𝕖𝕝𝕝𝕠 π•šπ•€ π•₯π•™π•šπ•€ 𝕀π•₯π•šπ•π• π•’π•§π•’π•šπ•π•’π•“π•π•–, π•†π•œ π”½π•šπ•£π•€π•₯ 𝕀 π•¨π•’π•Ÿπ•₯ π•₯𝕠 𝕦𝕀𝕖 𝕒 𝟞 π••π•šπ•˜π•šπ•₯ 𝕔𝕠𝕕𝕖 π•—π•£π• π•ž π”Ύπ• π• π•˜π•π•– π•π• π•šπ•”π•– π•₯𝕠 π•žπ•’π•œπ•– 𝕀𝕦𝕣𝕖 π•šπ•— π•ͺ𝕠𝕦 π•’π•Ÿπ•• π•₯𝕙𝕖 𝕑𝕠𝕀π•₯ 𝕒𝕣𝕖 𝕣𝕖𝕒𝕝

My questions:

  1. Why does the message text look like that? Is that a font?

  2. How does this kind of text bypass the manual RegEx text scanning? We do scan every message to catch any suspicious message.


Solution

  • The characters they used, 𝔸 (Mathematical Double Struck A (U+1D538)) - 𝕫 (Mathematical Double Struck z (U+1D56B)) are part of the Unicode characters, not a separate font.

    If you did not think about special Unicode characters when writing your regular expressions then they will not catch those because a simple /\w/ will not match any of those unless you also specify to explicitly match Unicode (often with the flag /u at the end of the expression).

    Similarly /A/ will not match "𝔸" simply because they are different characters and that pattern only matches that specific character.

    In order to not have to think about every possible way to represent a similar looking character, you can normalize your Unicode before running your regular expressions on them. That way, a consistent representation is guaranteed and you'll have a much easier time to write expressions that match a much wider range of texts.