I web scrape (using curl) a page and try to retrive LD-Json content.
So first I get the content of the page:
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$page = curl_exec($handle);
curl_close($handle);
and it works ok.
I check the $data content in a hex editor and see that the page is encoded correctly as UTF-8. For example characters "ół" are encoded as "C3 B3 C5 82" which is OK.
The problem starts when I query for ld-json scripts:
$dom = new DOMDocument();
@$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
$jsonScripts = $xpath->query( '//script[@type="application/ld+json"]' );
and then
foreach ($jsonScripts as $jScript)
{
$json = $jScript->nodeValue;
$data = json_decode($cleared, true);
suddenly the same characters are now encoded as "C3 83 C2 B3 C3 85 C2 82"
What just happend?
SOLVED
The problem was in the scraped page. The character set was defined as
<meta charset=UTF-8>
not
<meta charset="UTF-8">
The workaround was to change the code to:
@$dom->loadHTML('<?xml encoding="utf-8" ?>'.$page);
Thank you @ChrisHaas!