For my project, I'm using domcrawler to parse pages and extract images.
Code:
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 15,
));
$goutteClient->setClient($guzzleClient);
try {
$crawler = $goutteClient->request('GET', $url);
$crawlerError = false;
} catch (RequestException $e) {
$crawlerError = true;
}
if ($crawlerError == false) {
//find open graph image
try {
$file = $crawler->filterXPath("//meta[@property='og:image']")->attr('content');
} catch (\InvalidArgumentException $e) {
$file = null;
}
//if that fails, find the biggest image in the DOM
if (!$file) {
$images = $crawler
->filterXpath('//img')
->extract(array('src'));
$files = [];
foreach ($images as $image) {
$attributes = getimagesize($image);
//stopping here since this is where i'm getting my error
The relevant part is at the bottom. This will work some of the time. However, occasionally I get an error. For example, if $url
was https://www.google.com it would spit out the following error:
ErrorException (E_WARNING) getimagesize(/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png): failed to open stream: No such file or directory
If I dd($image);
in this situation, $image
equals "/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
.
However, if I try with a website that doesn't give me an error, like https://www.harvard.edu, dd($image);
returns "https://www.harvard.edu/sites/default/files/feature_item_media/Kremer900x600.jpg"
In other words, I'm not getting the full URL. How can I rectify this?
Prepend the relative links with the scheme and host. You can use parse_url
on $url
to extract the scheme and host, and can use the same function on $image
to detect if a scheme/host is set.