phparrayswhitelistpreg-splitmultibyte-characters

Get values which contain only whitelisted characters from a comma-delimited string


I have an array (converted from a string) that contains words with non-standard letters (letters not used in English, like ć, ä, ü). I don't want to replace those characters, I want to get rid of the whole words that have them.

from [Adam-Smith, Christine, Müller, Roger, Hauptstraße, X Æ A-12]
to   [Adam-Smith, Christine, Roger]

This is what I got so far:

<?php 
    $tags = "Adam-Smith, Christine, Müller, Roger, Hauptstraße, X Æ A-12";

    $tags_array = preg_split("/\,/", $tags); 

    $tags_array = array_filter($tags_array, function($value){
       return strstr($value, "a") === false;
    });

    foreach($tags_array as $tag) {
        echo "<p>".$tag."</p>";
    }
?> 

I have no idea how to delete words that are not [a-z, A-Z, 0-9] and [(), "", -, +, &, %, @, #] characters. Right now the code deletes every word with an "a". What should I do to achieve this?


Solution

  • $raw = 'Adam-Smith, Christine, Müller, Roger, Hauptstraße, X Æ A-12, johnny@knoxville, some(person), thing+asdf, Jude "The Law" Law, discord#124123, 100% A real person, shouldntadd.com';
    
    $regex = '/[^A-Za-z0-9\s\-\(\)\"\+\&\%\@\#]/';
    
    $tags = array_map('trim', explode(',', $raw));
    
    $tags = array_filter($tags, function ($tag) use ($regex) {
        return !preg_match($regex, $tag);
    });
    
    var_dump($tags);
    

    Yields:

    array(9) {
        [0]=>
        string(10) "Adam-Smith"
        [1]=>
        string(9) "Christine"
        [2]=>
        string(5) "Roger"
        [3]=>
        string(16) "johnny@knoxville"
        [4]=>
        string(12) "some(person)"
        [5]=>
        string(10) "thing+asdf"
        [6]=>
        string(18) "Jude "The Law" Law"
        [7]=>
        string(14) "discord#124123"
        [8]=>
        string(18) "100% A real person"
      }
    

    If you want to include a full stop as an allowable character (if you were checking for email addresses), you can add \. to the end of the regex.