phputf-8domxpath

Problem with encoding after using DOMXpath


I web scrape (using curl) a page and try to retrive LD-Json content.

So first I get the content of the page:

  $handle = curl_init();
  curl_setopt($handle, CURLOPT_URL, $url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);

  $page = curl_exec($handle);
  curl_close($handle);

and it works ok.

I check the $data content in a hex editor and see that the page is encoded correctly as UTF-8. For example characters "ół" are encoded as "C3 B3 C5 82" which is OK.

The problem starts when I query for ld-json scripts:

  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXpath($dom);
  $jsonScripts = $xpath->query( '//script[@type="application/ld+json"]' );

and then

      foreach ($jsonScripts as $jScript)
      {
          $json = $jScript->nodeValue;
          $data = json_decode($cleared, true);

suddenly the same characters are now encoded as "C3 83 C2 B3 C3 85 C2 82"

What just happend?


Solution

  • SOLVED

    The problem was in the scraped page. The character set was defined as

    <meta charset=UTF-8>
    

    not

    <meta charset="UTF-8">
    

    The workaround was to change the code to:

      @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$page);
    

    Thank you @ChrisHaas!