phpregexsanitizationgarbage

php regular expression to filter out junk


So I have an interesting problem: I have a string, and for the most part I know what to expect:

http://www.someurl.com/st=????????

Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ

Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.

The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will I need to roll up my sleeves and go nested-loop style?

update:

To clear up some confusion, I get an input string that's like this:

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage.


Solution

  • $var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
    

    $clean = join(
        array_filter(
            str_split($var, 1),
            function ($char) {
                return (
                    array_key_exists(
                        $char,
                        array_flip(array_merge(
                            range('A','Z'),
                            range('a','z'),
                            range((string)'0',(string)'9'),
                            array(':','.','/','-','_')
                        ))
                    )
                );
            }
        )
    );
    

    Hah, that was a joke. Here's a regex for you:

    $clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);