phparraysstringsimilaritysoundex

How can I generate all the wrong variants of words consisting of more than two letters?


There is an array in which there are many million words. And you need to create an associative array with the wrong variants of all these words passing the correct version of that word as the key. And the wrong variant of the word must not coincide with the correct words in the array. And still all the wrong variants of words, too, should not coincide with each other. All these generation of incorrect variants of words I need to correct incorrect Cyrillic words (not Russian words and not English). As an example, take the words "apple" and "lost". Array with correct words for creating incorrect variants this words:

<?php
$correct_words = array(
   "apple",
   "lost",
   "lot",
   "microsoft"
); 
?>

I want the result to be so:

<?php
$incorrect_variant_words = array(
    "aple"=>"apple",
    "lst"=>"lost",
    "lt"=>"lot",
    "microsot"=>"microsoft",
    "microsft"=>"microsoft",
    "microoft"=>"microsoft",
    "micrsoft"=>"microsoft",
    "micosoft"=>"microsoft",
    "mirosoft"=>"microsoft",
    "mcrosoft"=>"microsoft"
);
?>

I want to correct the incorrect words. Give advice or there is a solution for this task, please tell me. As for example in Google translator such function is implemented. How to get around this problem without the php extension of Pspell. Please help me to solve such a difficult task. To use as a correct word I also add an array of words with correct values.

<?php

$array = array(

  "миёнаҳои",
  "луғатҳои",
  "онандроҷ",
  "ганҷинаи",
  "ҷамъиятӣ",
  "иҷтимоии",
  "муҳаммад",
  "рӯзмарра",
  "ҳамзабон",
  "забонҳои",
  "ҳамчунин",
  "фарҳанге",
  "феҳристи",
  "зардуштӣ",
  "таркибҳо",
  "ибораҳои",
  "калимаҳо",
  "фарҳанги",
  "тобишҳои",
  "намунаҳо",
  "нусхаҳои",
  "фирдавсӣ",
  "ҳуруфоти",
  "мутобиқи",
  "тақрибан",
  "алоҳидаи",
  "тоисломӣ",
  "паҳлавик",
  "классикӣ",
  "мӯътабар",
  "қадамҳои",
  "баргаҳои"

);

?>

Thank you in advance


Solution

  • Use similar_text to iterate over the array of correct words and compare them to the input value. Return the word with the highest match percentage. Basic concept:

    $correct_words = array(
       "apple",
       "lost",
       "lot",
       "microsoft"
    );
    $input = 'lst';
    $match = 0;
    foreach ($correct_words as $correct) {
    similar_text($correct, $input, $percent);
        if ($percent > $match) {
            $result = $correct;
            $match = $percent;
        }
    }
    echo $result;
    

    Output is lost

    Edit to add result of your query

    $correct_words = array(
       "тоҷик",
       "тоҷикӣ",
       "тоҷики"
    );
    $input = array("тоҷикӣ", "тоҷики", "точик", "точикӣ", "точики", "тоики", "тоикӣ", "тоҷӣкӣ", "тҷикӣ", "тчики", "тҷӣкӣ", "тчик");
    foreach ($input as $in) {
    $match = 0;
        foreach ($correct_words as $correct) {
    similar_text($correct, $in, $percent);
        if ($percent > $match) {
            $result = $correct;
            $match = $percent;
        }
    }
    echo "$in is corrected to $result\r\n";
    }
    

    Result is:

    тоҷикӣ is corrected to тоҷикӣ
    тоҷики is corrected to тоҷики
    точик is corrected to тоҷик
    точикӣ is corrected to тоҷикӣ
    точики is corrected to тоҷики
    тоики is corrected to тоҷики
    тоикӣ is corrected to тоҷикӣ
    тоҷӣкӣ is corrected to тоҷикӣ
    тҷикӣ is corrected to тоҷикӣ
    тчики is corrected to тоҷики
    тҷӣкӣ is corrected to тоҷикӣ
    тчик is corrected to тоҷик