phpdomhtml-parsingdomdocumentphp-parser

How can I remove all tags except an allowed list from html parsed by php


I am parsing html in php and as I have no control over the original content I want to strip it of styling and unnecessary tags while still keep the content and a short list of tags, namely:

p, img, iframe (and maybe a couple of others)

I know I can remove a given tag (see code I am using for this below), but as I don't necessarily know what tags their could possibly be, and I don't want to create a huge list of possibles, I would like to be able to strip everything except my allowed list.

function DOMRemove(DOMNode $from) {
    $sibling = $from->firstChild;

    do {
        $next = $sibling->nextSibling;
        $from->parentNode->insertBefore($sibling, $from);
    } while ($sibling = $next);

    $from->parentNode->removeChild($from);
}

$dom = new DOMDocument;
$dom->loadHTML($html);

$nodes = $dom->getElementsByTagName('span');

Solution

  • As spoken by cpattersonv1 above, you can simply use strip_tags() for the job.

    <?php
    
    // strip all other tags except mentioned (p, img, iframe)
    $html_result = strip_tags($html, '<p><img><iframe>');
    
    ?>