I am using "fabpot/goutte": "^4.0",
.
I am trying to get from the site the date and the release in an array.
Please find my runnable example:
<?php
require_once '../vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
use Goutte\Client;
try {
$resArr = array();
$tempArr = array();
$url = "https://www.steelcitycollectibles.com/product-release-calendar";
// get page
$client = new Client();
$content = $client->request('GET', $url)->html();
$crawler = new Crawler($content, null, null);
$table = $crawler->filter('#schedule'); //->first()->closest('table');
$index = 0;
$resArr = array();
$table->filter('div')
->each(function (Crawler $tr) use (&$index, &$resArr) {
if ($tr->filter('.schedule-date')->count() > 0) {
$releaseDate = $tr->filter('.schedule-date')->text();
}
if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
$releaseStr = $tr->filter('div > div.eight.columns > a')->text();
array_push($resArr, [$releaseDate, $releaseStr]);
}
});
var_dump($resArr);
} catch (Exception $e) {}
However, I do not get for each item the correct date:
For the null values I would like to add the correct date. In this case 12/20/21
.
Assuming you want to apply the most recently seen date to each element of the array, you simply need to set a default and then update it within the loop. This will have to be another pass by reference since the anonymous function state is reset on each pass.
<?php
require_once '../vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
use Goutte\Client;
try {
$resArr = [];
$content = <<< HTML
<div id="schedule" class="schedule nine columns">
<div class="schedule-date">12/22/21</div>
<div class="schedule-list clear">
<div class="eight columns">
<a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case</a>
</div>
<div class="schedule-notify three columns">
<release-schedule-notify type="'release'"/>
</div>
</div>
<div class="schedule-list clear">
<div class="eight columns">
<a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box</a>
</div>
<div class="schedule-notify three columns">
<release-schedule-notify type="'release'"/>
</div>
</div>
<div class="schedule-date">12/24/21</div>
<div class="schedule-list clear">
<div class="eight columns">
<a href="xxx">2021 Panini Flawless Baseball Hobby 2-Box Case</a>
</div>
<div class="schedule-notify three columns">
<release-schedule-notify type="'release'"/>
</div>
</div>
<div class="schedule-list clear">
<div class="eight columns">
<a href="xxx">2021 Panini Flawless Baseball Hobby Box</a>
</div>
<div class="schedule-notify three columns">
<release-schedule-notify type="'release'"/>
</div>
</div>
HTML;
$crawler = new Crawler($content, null, null);
$table = $crawler->filter('#schedule');
// use today's date as a default, in case first one is missing
$releaseDate = (new DateTime())->format("m/d/y");
$table->filter('div')
->each(function (Crawler $tr) use (&$index, &$resArr, &$releaseDate) {
if ($tr->filter('.schedule-date')->count() > 0) {
// update the date if it exists, otherwise continue with the old one
$releaseDate = $tr->filter('.schedule-date')->text();
}
if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
$releaseStr = $tr->filter('div > div.eight.columns > a')->text();
$resArr[] = [$releaseDate, $releaseStr];
}
});
} catch (Exception $e) {}
echo json_encode($resArr, JSON_PRETTY_PRINT);
Output:
[
[
"12\/22\/21",
"2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
],
[
"12\/22\/21",
"2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
],
[
"12\/22\/21",
"2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box"
],
[
"12\/24\/21",
"2021 Panini Flawless Baseball Hobby 2-Box Case"
],
[
"12\/24\/21",
"2021 Panini Flawless Baseball Hobby Box"
]
]
As a side note, the documentation for Goutte says the request()
method returns a Crawler
object. You're needlessly pulling out the HTML and creating a Crawler
object manually. Change your code to this:
// get page
$crawler = (new Client)->request('GET', $url);