phpparsingdomsimplexmlsimple-html-dom

Cleaning html code from another website using php


I want to get some data from this website but as you can see in their html code there are some weird stuff going on as <TABLE BORDER=0 CELLSPACING=1 CELLPADDING=3 WIDTH=100%> without using "" and some other stuff, so I'm having errors when I try to parse the table using SimpleXmlElement which I have been using for a few time and works perfectly in some websites, I'm doing something like:

$html = file_get_html('https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera');
$table = $html->find('table', 4);

$xml = new SimpleXmlElement($table);

I get a bunch of erros and stuff, so is there a way of cleaning the code before sending to SimpleXmlElement or perhaps using another kind of DOM class? What do you guys recommend?


Solution

  • The problem with your HTML code is that the tag attributes are not wrapped by quotes: unquoted attributes are allowed in HTML, but not in XML.

    If you don't care about attributes, you can continue using Simple HTML Dom, otherwise you have to change HTML parser.

    Cleaning attributes with Simple HTML DOM:

    Start creating a function to clear all node attributes:

    function clearAttributes( $node )
    {
        foreach( $node->getAllAttributes() as $key => $val )
        {
            $node->$key = Null;
        }
    }
    

    Then apply the function to your <table>, <tr> and <td> nodes:

    clearAttributes( $table );
    
    foreach( $table->find('tr') as $tr )
    {
        clearAttributes( $tr );
    
        foreach( $tr->find( 'td' ) as $td )
        {
            clearAttributes( $td );
        }
    
    }
    

    Last but not least: site HTML contains a lot of encoded characters. If you don't want see a lot of <td>1&#xA0;</td><td>0&#xA0;</td> inside your XML, you have to prepend at your string a utf-8 declaration before importing it in a SimpleXml object:

    $xml = '<?xml version="1.0" encoding="utf-8" ?>'.html_entity_decode( $table );
    $xml = new SimpleXmlElement( $xml );
    

    Preserving attributes with DOMDocument:

    The built-in DOMDocument class is more powerful and less memory hungry than Simple HTML Dom. In this case, it will well-format original HTML for you. Despite appearances, its use is simple.

    First, you have to init a DOMDocument object, setting libxml_use_internal_errors (to suppress a lot of warnings on malformed HTML) and load your url:

    $dom = new DOMDocument();
    libxml_use_internal_errors( 1 );
    $dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
    $dom->formatOutput = True;
    

    Then, you retrieve desired <table>:

    $table = $dom->getElementsByTagName( 'table' )->item(4);
    

    And, like in Simple HTML Dom example, you have to prepend utf-8 declaration to avoid weird characters:

    $xml = '<?xml version="1.0" encoding="utf-8" ?>'.$dom->saveHTML( $table );
    $xml = new SimpleXmlElement( $xml );
    

    As you can see, the DOMDocument syntax to retrieve a node as HTML is different than Simple HTML Dom: you need always to refer to main object and specify the node to print as argument:

    echo $dom->saveHTML();          // print entire HTML document
    echo $dom->saveHTML( $node );   // print node $node
    

    Edit: removing &nbsp; with DOMDocument:

    To remove unwanted &#160; from HTML, you can pre-load HTML and use str_replace.

    Change this line:

    $dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
    

    with this:

    $data = file_get_contents( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
    $data = str_replace( '&#160;', '', $data );
    $dom->loadHTML( $data );