phpgouttedomcrawler

DomCrawler filterXpath not always giving full URL


For my project, I'm using domcrawler to parse pages and extract images.

Code:

$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
    'timeout' => 15,
));

$goutteClient->setClient($guzzleClient);

try {
    $crawler = $goutteClient->request('GET', $url);
    $crawlerError = false;
} catch (RequestException $e) {
    $crawlerError = true;
}

if ($crawlerError == false) {

    //find open graph image
    try {
        $file = $crawler->filterXPath("//meta[@property='og:image']")->attr('content');
    } catch (\InvalidArgumentException $e) {
        $file = null;
    }

    //if that fails, find the biggest image in the DOM      
    if (!$file) {
        $images = $crawler
        ->filterXpath('//img')
        ->extract(array('src'));    

        $files = [];
        foreach ($images as $image) {

            $attributes = getimagesize($image);
            //stopping here since this is where i'm getting my error

The relevant part is at the bottom. This will work some of the time. However, occasionally I get an error. For example, if $url was https://www.google.com it would spit out the following error:

ErrorException (E_WARNING) getimagesize(/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png): failed to open stream: No such file or directory

If I dd($image); in this situation, $image equals "/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png".

However, if I try with a website that doesn't give me an error, like https://www.harvard.edu, dd($image); returns "https://www.harvard.edu/sites/default/files/feature_item_media/Kremer900x600.jpg"

In other words, I'm not getting the full URL. How can I rectify this?


Solution

  • Prepend the relative links with the scheme and host. You can use parse_url on $url to extract the scheme and host, and can use the same function on $image to detect if a scheme/host is set.