I'm scraping this page:
http://kat.ph/search/example/?field=seeders&sorder=desc
In this way:
...
curl_setopt( $curl, CURLOPT_URL, $url );
$header = array (
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding:gzip,deflate,sdch',
'Accept-Language:en-US,en;q=0.8',
'Cache-Control:max-age=0',
'Connection:keep-alive',
'Host:kat.ph',
'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
);
curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19');
curl_setopt( $curl, CURLOPT_HTTPHEADER, $header );
curl_setopt( $curl, CURLOPT_REFERER, 'http://kat.ph' );
curl_setopt( $curl, CURLOPT_ENCODING, 'gzip,deflate,sdch' );
curl_setopt( $curl, CURLOPT_AUTOREFERER, true );
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt( $curl, CURLOPT_TIMEOUT, 10 );
$html = curl_exec( $curl );
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
@$dom->loadHTML( $html );
(Had to mimic the browser for this to work, hence CURL)
But I still get DOMNodes
of type #text
which consist of just whitespace characters.
Any ideas of why is this happening and how to avoid it?
It looks like that the preserveWhiteSpace
property simply sets the libxml2 XML_PARSE_NOBLANKS
flag, which is not always reliable as this thread suggests. Specifically, when parsing without a DTD as in this case the parser keeps empty text elements under some circumstances (mainly if they are siblings of other non-text elements).
The thread may be a bit dated, but the behavior still exists as described.