phphtmlhtml-entitieshtmlspecialcharsstrip-tags

replace all but certain html tags with htmlspecialchars() in PHP?


I would like to process my user input to allow only certain html tags, and replace the other ones by their html entities, as well as replace non-tag-characters. For example, if I only wanted to allow the <b> and the <a> tag, then

allow_only("This is <b>bold</b> and this is <i>italic</i>.
            Moreover 2<3 and <a href='google.com'>this is a link</a>.","<b><a>");

should produce

This is <b>bold</b> and this is &lt;i&gt;italic&lt;/i&gt;.
Moreover 2&lt;3 and <a href='google.com'>this is a link</a>.

How can I do this in PHP? I am aware of strip_tags() that can remove the unwanted tags completely, and I'm aware of htmlspecialchars() which can replace all tags by their html entities, but none where only specific tags get replaced. How can this be done in PHP?

And if there is no 'common' way to do this, how should I in general go on processing user input that can have valid regular html, but can also have < signs and potentially dangerous html code?


Solution

  • Apply htmlspecialchars and then replace encoded entities with regular entities for a given array of tags

    function allow_only($str, $allowed){
        $str = htmlspecialchars($str);
        foreach( $allowed as $a ){
            $str = str_replace("&lt;".$a."&gt;", "<".$a.">", $str);
            $str = str_replace("&lt;/".$a."&gt;", "</".$a.">", $str);
        }
        return $str;
    }
    echo allow_only("This is <b>bold</b> and this is <i>italic</i>.", array("b"));
    

    That works for simple tags, returning "This is bold and this is <i>italic</i>."

    As it was pointed out, that doesn't work for tags with attributes, but this does:

    function fix_attributes($match){
        // TODO: study $match[2] in depth and avoid banned attributes
        // eg: those that begin with on, or href that begins with javascript:
        // to avoid some potential hacks
        return "<".$match[1].str_replace('&quot;','"',$match[2]).">";
    }
    function allow_only($str, $allowed){
        $str = htmlspecialchars($str);
        foreach( $allowed as $a ){
            $str = preg_replace_callback("/&lt;(".$a."){1}([\s\/\.\w=&;:#]*?)&gt;/", fix_attributes, $str);
            $str = str_replace("&lt;/".$a."&gt;", "</".$a.">", $str);
        }
        return $str;
    }
    echo allow_only('This is <b>bold</b> and <a href="http://www.#links">this</a> is <i>italic</i>.', array("b","a"));
    

    that handles more complex tags with certain attributes, only the characters listed between [] are allowed to appear in attributes by this. Unfortunately &quot; must be allowed within attributes or it won't work, and with it all other entities are allowed too - however only &quot in attributes will be decoded.

    As it was suggested a much better (safer, cleaner) way to solve problems like this to use a library like http://htmlpurifier.org/demo.php