phpgouttedomcrawler

Goutte - Get list with date on top and title below


I am using "fabpot/goutte": "^4.0",.

I am trying to get from the site the date and the release in an array.

Please find my runnable example:

<?php

require_once '../vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;
use Goutte\Client;

try {

    $resArr = array();
    $tempArr = array();

    $url = "https://www.steelcitycollectibles.com/product-release-calendar";

    // get page
    $client = new Client();
    $content = $client->request('GET', $url)->html();
    $crawler = new Crawler($content, null, null);

    $table = $crawler->filter('#schedule'); //->first()->closest('table');

    $index = 0;
    $resArr = array();
    $table->filter('div')
        ->each(function (Crawler $tr) use (&$index, &$resArr) {

            if ($tr->filter('.schedule-date')->count() > 0) {
                $releaseDate = $tr->filter('.schedule-date')->text();
            }

            if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
                $releaseStr = $tr->filter('div > div.eight.columns > a')->text();
                array_push($resArr, [$releaseDate, $releaseStr]);
            }

        });

    var_dump($resArr);
} catch (Exception $e) {}

However, I do not get for each item the correct date:

enter image description here

For the null values I would like to add the correct date. In this case 12/20/21.


Solution

  • Assuming you want to apply the most recently seen date to each element of the array, you simply need to set a default and then update it within the loop. This will have to be another pass by reference since the anonymous function state is reset on each pass.

    <?php
    
    require_once '../vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    use Goutte\Client;
    
    try {
    
        $resArr = [];
    
        $content = <<< HTML
    <div id="schedule" class="schedule nine columns">
        <div class="schedule-date">12/22/21</div>
        <div class="schedule-list clear">
            <div class="eight columns">
                <a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case</a>
            </div>
            <div class="schedule-notify three columns">
                <release-schedule-notify type="'release'"/>
            </div>
        </div>
        <div class="schedule-list clear">
            <div class="eight columns">
                <a href="xxx" class="schedule-product-title ">2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box</a>
            </div>
            <div class="schedule-notify three columns">
                <release-schedule-notify type="'release'"/>
            </div>
        </div>
        <div class="schedule-date">12/24/21</div>
        <div class="schedule-list clear">
            <div class="eight columns">
                <a href="xxx">2021 Panini Flawless Baseball Hobby 2-Box Case</a>
            </div>
            <div class="schedule-notify three columns">
                <release-schedule-notify type="'release'"/>
            </div>
        </div>
        <div class="schedule-list clear">
            <div class="eight columns">
                <a href="xxx">2021 Panini Flawless Baseball Hobby Box</a>
            </div>
            <div class="schedule-notify three columns">
                <release-schedule-notify type="'release'"/>
            </div>
        </div>
    HTML;
    
        $crawler = new Crawler($content, null, null);
    
        $table = $crawler->filter('#schedule');
    
        // use today's date as a default, in case first one is missing
        $releaseDate = (new DateTime())->format("m/d/y");
        $table->filter('div')
            ->each(function (Crawler $tr) use (&$index, &$resArr, &$releaseDate) {
                if ($tr->filter('.schedule-date')->count() > 0) {
                    // update the date if it exists, otherwise continue with the old one
                    $releaseDate = $tr->filter('.schedule-date')->text();
                }
                if ($tr->filter('div > div.eight.columns > a')->count() > 0) {
                    $releaseStr = $tr->filter('div > div.eight.columns > a')->text();
                    $resArr[] = [$releaseDate, $releaseStr];
                }
            });
    } catch (Exception $e) {}
    
    echo json_encode($resArr, JSON_PRETTY_PRINT);
    

    Output:

    [
        [
            "12\/22\/21",
            "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
        ],
        [
            "12\/22\/21",
            "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 2-Box Case"
        ],
        [
            "12\/22\/21",
            "2022 Gold Rush Autographed Full-Size Speed Flex Helmet Edition Series 1 Box"
        ],
        [
            "12\/24\/21",
            "2021 Panini Flawless Baseball Hobby 2-Box Case"
        ],
        [
            "12\/24\/21",
            "2021 Panini Flawless Baseball Hobby Box"
        ]
    ]
    

    As a side note, the documentation for Goutte says the request() method returns a Crawler object. You're needlessly pulling out the HTML and creating a Crawler object manually. Change your code to this:

    // get page
    $crawler = (new Client)->request('GET', $url);