phphtmlhtml-parsinghtml-tablehtml-tableextract

PHP parse HTML tables and make them correct HTML5 tables


I'm looking for the best way to clean up old HTML tables (with PHP) so that they are correct HTML5 tables - it's mostly a matter of stripping not allowed attributes. In addition to that, I'd also like to strip inline styles of these tables. It would be really great if that can be accomplished in one go.

I have been researching regular expresions mostly, but after reading that regular expressions are not recommended to perform that, I am looking for something else that would help.


Solution

  • A quick example of how you could use DOMDocument to strip attributes - could extend that to also add attributes but that is another matter.

    $strhtml="
    <table width='100%' cellpadding='10px' cellspacing='5px' border='2px'>
        <tr>
            <td align='left' valign='top'>banana</td>
        </tr>
    </table>";
    
    $remove=array('cellpadding','cellspacing','border','align','valign');
    
    
    $dom=new DOMDocument;
    $dom->loadHTML( $strhtml );
    
    $elements=$dom->getElementsByTagName('*');
    foreach( $elements as $node ){
        foreach( $remove as $attrib ){
            if( $node->hasAttribute( $attrib ) ){
                $node->removeAttribute( $attrib );
            }
        }
    }
    
    /* debug output */
    echo '<textarea cols=100 rows=10>',$dom->saveHTML(),'</textarea>';