regexperlspamlanguage-detection

How can I detect Russian spam posts with Perl?


I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?


Solution

  • You can use the following to detect Cyrillic characters (used in Russian):

    [\u0400-\u04FF]+
    

    If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.