phpunicodeutf-8preg-replacespam-prevention

Enforce English only on PHP form submission


I would like the contact form on my website to only accept text submitted in English. I've been dealing with a lot of spam recently that has appeared in multiple languages that is slipping right past the CAPTCHA. There is simply no reason for anyone to submit this form in a language other than English since it's not a business and more of a hobby for personal use.

I've been looking through this documentation and was hopeful that something like preg_match( '/[\p{Latin}]/u', $input) might work, but I'm not bilingual and don't understand all the nuances of character encoding, so while this will help filter out something like Russian it still allows languages like Vietnamese to slip through.

Ideally I would like it to accept:

And I would like it to reject:

I'm thinking of simply stripping all potentially valid characters as follows:

$input = 'testing for English only!';

// reference: https://en.wikipedia.org/wiki/List_of_Unicode_characters
// allowed punctuation
$basic_latin = '`~!@#$%^&*()-_=+[{]}\\|;:\'",<.>/?';
$input = str_replace(str_split($basic_latin), '', $input);

// allowed symbols and accents
$latin1_supplement = '¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿É×é÷';
$input = str_replace(str_split($latin1_supplement), '', $input);
$unicode_symbols = '–—―‗‘’‚‛“”„†‡•…‰′″‹›‼‾⁄⁊';
$input = str_replace(str_split($unicode_symbols), '', $input);

// remove all spaces including tabs and end lines
$input = preg_replace('/\s+/', '', $input);

// check that remaining characters are alpha-numeric
if (strlen($input) > 0 && ctype_alnum($input)) {
    echo 'this is English';
} else {
    echo 'no bueno señor';
}

However, I'm afraid there might be some perfectly common and valid exceptions that I'm unwittingly leaving out. I'm hoping that someone might be able to offer a more elegant solution or approach?


Solution

  • There are no native PHP features that would provide language recognition. There's an abandoned Pear package and some classes floating around the cyberspace (I haven't tested). If an external API is fine, Google's Translation API Basic can detect language, 500K free characters per month.

    There is however a very simple solution to all this. We don't really need to know what language it is. All we need to know is whether it's reasonably valid English. And not Swahili or Klingon or Russian or Gibberish. Now, there is a convenient PHP extension for this: PSpell.

    Here's a sample function you might use:

    /**
     *  Spell Check Stats.
     *  Returns an array with OK, FAIL spell check counts and their ratio.
     *  Use the ratio to filter out undesirable (non-English/garbled) content.
     *  
     *  @updated 2022-12-29 00:00:29 +07:00
     *  @author @cmswares
     *  @ref https://stackoverflow.com/q/74910421/4630325
     *
     *  @param string   $text
     *  
     *  @return array
     */
    
    function spell_check_stats(string $text): array
    {
        $stats = [
            'ratio' => null,
            'ok' => 0,
            'fail' => 0
        ];
        
        // Split into words
        $words = preg_split('~[^\w\']+~', $text, -1, PREG_SPLIT_NO_EMPTY);
        
        // Nw PSpell:
        $pspeller = pspell_new("en");
        
        // Check spelling and build stats
        foreach($words as $word) {
            if(pspell_check($pspeller, $word)) {
                $stats['ok']++;
            } else {
                $stats['fail']++;
            }
        }
        
        // Calculate ratio of OK to FAIL
        $stats['ratio'] = match(true) {
            $stats['fail'] === 0 => 0, // avoiding division by zero here!
            $stats['ok'] === 0 => count($words), 
            default => $stats['ok'] / $stats['fail'],
        };
    
        return $stats;
    }
    

    Source at BitBucket. Function usage:

    $stats = spell_check_stats('This starts in English, esto no se quiere, tätä ei haluta.');
    // ratio: 0.7142857142857143, ok: 5, fail: 7
    

    Then simply decide the threshold at which a submission is rejected. For example, if 20 words in 100 fail; ie. 80:20 ratio, or "ratio = 4". The higher the ratio, the more (properly-spelled) English it is.

    The "ok" and "fail" counts are also returned in case you need to calibrate separately for very short strings. Run some tests on existing valid and spam content to see what sorts of figures you get, and then tune your rejection threshold accordingly.


    PSpell package for PHP may not be installed by default on your server. On CentOS / RedHat, yum install php-pspell aspell-en, to install both the PHP module (includes ASpell dependency), along with an English dictionary. For other platforms, install per your package manager.

    For Windows and modern PHP, I can't find the extension dll, or a maintained Aspell port. Please share if you've found a solution. Would like to have this on my dev machine too.