regexspam-prevention

Understanding SpamAssassin HK_RANDOM regex


SpamAssassin has several rules that attempt to detect "random looking" values. For example:

/^(?!(?:mail|bounce)[_.-]|[^@]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer@)|[^@]{26}|.*?@.{0,20}\bcmp-info\.com$)[^@]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi

I understand that the first part of the regex prevents certain cases from matching:

(?!(?:mail|bounce)[_.-]|[^@]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer@)|[^@]{26}|.*?@.{0,20}\bcmp-info\.com$)

However, I am not able to understand how the second part detects "randomness". Any help would be greatly appreciated!

/[^@]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi

Solution

  • It will match strings containing 5 consecutive consonants (excluding h and s for some reason) :

    [bcdfgjklmnpqrtvwxz]{5}
    

    or 5 consecutive vowels :

    [aeiouy]{5}
    

    or the same letter or couple of letters repeated 3 times (present 4 times) :

    ([a-z]{1,2})(?:\1){3}
    

    Here are a few examples of strings it will match :

    somethingmkfkgkmsomething
    aiaioe
    totototo
    aaaa
    

    It obviously can't detect randomness, however it can identify patterns that don't often happen in meaningful strings, and mention these patterns look random.

    It is also possible that these patterns are constructed "from experience", after analysis of a number of emails crafted by spammers, and would actually reflect the algorithms behind the tools used by these spammers or the process they use to create these emails (e.g. some degree of keyboard mashing ?).

    Bottom note is that you can't detect randomness on a single piece of data. What you can do however is try to detect purpose, and if you don't find any then assume that to the best of your knowledge it is random. SpamAssasin assumes a few rules about human communication (which might fit different languages better or worse : as is it will flag a few forms of French's imperfect tense such as "échouaient"), and if the content doesn't match them it reports it as "random".