phparraysfilteringcpu-wordblacklist

Generate an array of two-word strings which do not include blacklisted words


I have an array with single words in them, for example:

[
    'This'
    'is',
    'a',
    'test',
    'array',
]

Next, I create another array with overlapping 2-word pairs which looks like this:

[
    'This is',
    'is a',
    'a test',
    'test array',
]

I have an array of "common words"; those words should be used to exclude pairs of words from the result array.

Let's say those common words would be is and a for this example. Right now, I search for common words first on the single word array so I can use if(in_array($word, $common_words)) continue; which makes it skip the one if it's in the common_words array.

But this would result in this array:

[
    'This test',
    'test array',
]

But this is not how I want it to happen. It should be like this:

[
    'test array',
]

Because this is the only 1 that had these 2 words next to each other originally before we started to take out the 'common_words'. (are you still with me?)

The problem here is that if(in_array) doesn't work anymore if I have an array with 2 words. So I did some research and stumbled upon the array_filter command. I think this is what I need but I'm at a total loss as on how to use/apply it to my code.


Solution

  • Your guess is correct, you can use:

    $array = ['this is', 'array array', 'an array', 'test array'];
    $stop  = ['is', 'test'];
    
    $array = array_filter($array, function($x) use ($stop)
    {
       return !preg_match('/('.join(')|(', $stop).')/', $x);
    });
    

    -i.e. exclude all items with certain words in it by pattern using array_filter()

    This will work with filtering because it will match by regex, i.e. from $stop we'll get regex (is)|(test)

    A good idea will be to evaluate regex separately so do not evaluate it each time inside array_filter() iteration, like:

    $array   = ['this is', 'array array', 'an array', 'test array'];
    $stop    = ['is', 'test'];
    $pattern = '/('.join(')|(', $stop).')/';
    
    $array = array_filter($array, function($x) use ($pattern)
    {
       return !preg_match($pattern, $x);
    });
    

    Important note#1: if your stop words may contain some special characters that will be treated in regex in special way, it's needed to use preg_quote() like:

    $pattern = '/'.join('|', array_map(function($x)
    {
       return '('.preg_quote($x, '/').')';
    }, $stop)).'/';
    
    $array = array_filter($array, function($x) use ($pattern)
    {
       return !preg_match($pattern, $x);
    });
    

    Important note#2: If your array of stopwords is too long this may cause regex compilation fail because of it's length (too large). There are some tricks to overcome it, but if it's your case, you'd better to use strpos() instead:

    $array = array_filter($array, function($x) use ($stop)
    {
       foreach($stop as $word)
       {
          if(false!==strpos($x, $word))
          {
             return false;
          }
       }
       return true;
    });