phphtmlstrip-tags

PHP prevent strip_tags from removing broken tags


I have the same situation as this this guy.

Basically strip_tags removes tags including broken tags (the term used in the documentation). Is there another way of doing this that doesn't involve removing < and any text after it if it's not an HTML tag?

I'm currently doing this:

$description = "&lt;p&gt;I am currently &lt;30 years old.&lt;/p&gt;";
$body = strip_tags(html_entity_decode($description, ENT_QUOTES, "UTF-8"), "<strong><em><u>");
echo $body;

But the code above will break something like:

<p>I am currently <30 years old.</p>

Into:

I am currently

eval.in

Here's an eval.in so you guys could see what I mean.


Solution

  • The HTML you have as input is invalid. So that needs fixing. You could replace all those unclosed < by &lt; first, and then do your html_entity_decode after strip_tags:

    $description = "<p>I am currently <30 years old.</p>";
    $description = preg_replace("/<([^>]*?(?=<|$))/", "&lt;$1", $description);
    $body = html_entity_decode(strip_tags($description, "<strong><em><u>"),
                               ENT_NOQUOTES, "UTF-8");
    echo $body;
    

    See it on paiza.io

    Alternatively you could use a DOM parser, which in some cases could give better results, but you'll still need to apply the fix first:

    $description = "<p>I am currently <30 years old.</p>";
    $description = preg_replace("/<([^>]*?(?=<|$))/", "&lt;$1", $description);
    $doc = new DOMDocument();
    $doc->loadHTML($description);
    $body = $doc->documentElement->textContent;
    echo $body;
    

    See it on paiza.io