phpweb-scrapingdata-extractiondomparserimage-optimization

How to extract the image paths and their recommended new dimensions for Automated Image Optimisation?


I am creating a php script to scrape the images and respective dimension recommendations from https://gtmetrix.com/reports/example.com/a_unique_code.

After extracting the image path and the suggested new height and width, I will programmatically optimize my images.

The following is the relevant portion of the html returned from the Uniform Resource Locator:

<tr class="rules-details" style="display: none">
    <td colspan="4">
        <a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="&lt;h4&gt;Serve scaled images&lt;/h4&gt;&lt;p&gt;Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.&lt;/p&gt;&lt;p class=&quot;rule-help-tooltip-more&quot;&gt;&lt;a href=&quot;/serve-scaled-images.html&quot;&gt;Read more&lt;/a&gt;&lt;/p&gt;"><i class="sprite-question"></i><span class="resp-hidden">What's this mean?</span></a>
        <div>
            <p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
                <ul>
                    <li><a href="https://www.example.com/Pictures/thumbs/0029.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0029.jpg</a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0133.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0133.jpg</a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0075.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0075.jpg</a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0057.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0057.jpg</a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                    <li><a href="https://www.example.com/Pictures/thumbs/0093.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0093.jpg</a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
                </ul>
            </p>
        </div>
    </td>
</tr>

After advice from John Conde to use a DOM parser, here is my coding attempt:

$html = file_get_contents('https://gtmetrix.com/reports/example.com/a_unique_code');
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
$stack = array();

$expression = './/tr[contains(concat(" ", normalize-space(@class), " "), " rules-details ")]';
foreach ($xpath->evaluate($expression) as $tr) 
{
    array_push($stack, $tr->nodeValue);
}
$i=0;
foreach ($stack as $string) 
{
    $search_string = $string;
    $find = 'reduction';
    $pos = strpos($search_string, $find);
    if($pos===false){}
    else
    {
        $string = str_replace("What's this mean?","",$string);
        $string = trim(preg_replace("/\s+/", " ", $string));
        $string_array = explode(').', $string);
        for($i=0;$i<sizeof($string_array);$i++)
        {
            $search_string = $string_array[$i];
            $find = 'The following images are resized in HTML or CSS.';
            $pos = strpos($search_string, $find);
            if($pos===false){}
            else
            {
                unset($string_array[$i]);
            }

            $find = "Optimize the following images to reduce their size by";
            $pos = strpos($search_string, $find);
            if($pos===false){}
            else
            {
                $current_index = $string_array[$i];
                $array_size = sizeof($string_array);

                for($j=$current_index;$j<$array_size;$j++)
                {
                    unset($string_array[$i]);
                }
            }

            echo '<pre>'.$string_array[$i];
        }
    }
}

The question is, given the following string, how do I extract the url and second image dimension?

example.com/Pictures/thumbs/0093.jpg is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).

I need:

I will be optimizing this prototype script, but this is how I am implementing JohnConde's answer:

<?php

// #########################################
// AUTOMATED IMAGE OPTIMIZATION
// #########################################

class Image
{
    public $image_url;
    public $image_name;
    public $image_path;
    public $image_full_path;
    public $original_size;
    public $new_size;
}

$debugging = true;

if($debugging === true){echo '<ul class="Results" style="display:block; height:auto;">';}

try
{

    $HTML = file_get_contents('https://gtmetrix.com/reports/www.example.com/a_unique_code');// Get Webpage
    switch($HTML)
    {

        case false:
            if($debugging === true)
            {
                $error = error_get_last();
                echo '<li class="Error_Msg" style="display:block; height:auto;">';
                echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
                echo '<span><b>Message:</b> Could not retrieve the HTML document</span>';
                echo '</li>';
                error_clear_last();
                exit;
            }
            break;

        default:// START OF WRAPPER

            $DOMdoc = new DOMDocument();// Object to store an HTML document
            libxml_use_internal_errors(true);// 
            $html = @$DOMdoc->loadHTML($HTML);// Parse the HTML
            $racks = (new DOMXPath($DOMdoc))->query('//tr/td/div//ul/li');// Creates a new DOMXPath object from the XPath expression
            $images_info_array = array();// Array for storing image details objects
            $document_root = $_SERVER['DOCUMENT_ROOT'];// Define the document root

            foreach($racks as $rack)// Traverse over the HTML structure
            {
                // Define a pattern to search for
                $expression = "/https?\:\/\/[^\",]+ is resized in HTML or CSS from \d{1,3}x\d{1,3} to \d{1,3}x\d{1,3}./";
                if(preg_match_all($expression, $rack->nodeValue, $matched) == 1)// If the pattern is found then
                {
                    $url = $rack->firstChild->nodeValue;// Get the URL from the string
                    preg_match_all('/\d{1,4}x\d{1,4}/', $rack->nodeValue, $matches);// Get the image dimensions from the string
                    [$original_size, $new_size] = $matches[0];// 

                    $url_parts = parse_url($url);// Break the URL up into sections
                    $directory_path = $url_parts['path'];// Get the directory path without the domain
                    $path_parts = pathinfo($directory_path);// Get information about a file path

                    $position = strpos($directory_path, '/');// Find the first / in the file path
                    if ($position !== false)// If found 
                    {

                        $new_directory_path = substr_replace($directory_path, "", $position, strlen('/'));// Remove the /

                        $image_info = new Image();// Create a new Image Object 
                        $image_info->image_url = $url;// Store the image URL
                        $image_info->image_name = basename($url);// Store just the image name
                        $image_info->image_path = $path_parts['dirname'];// Store image directory without domain & file name
                        $image_info->image_full_path = $new_directory_path;// 
                        $image_info->original_size = $original_size;// Store the original image size
                        $image_info->new_size = $new_size;// Store the new image size

                        array_push($images_info_array, $image_info);// Add the image information to an array

                    }else{
                        if($debugging === true)
                        {
                            $error = error_get_last();
                            echo '<li class="Warning_Msg">';
                            echo '<span><b>## WARNING - FILE PATH CHARACTER MISSING ##</b></span>';
                            echo '<span><b>Message:</b> / in the file path not found</span>';
                            echo '</li>';
                            error_clear_last();
                        }
                    }

                }else{// If the pattern is not found then
                    if($debugging === true)
                    {
                        $error = error_get_last();
                        echo '<li class="Error_Msg" style="display:block; height:auto;">';
                        echo '<span><b>## FATAL ERROR - PROGRAM ABORTED ##</b></span>';
                        echo '<span><b>Message:</b> Could not find the pattern required to extract the URL & size information</span>';
                        echo '</li>';
                        error_clear_last();
                        exit;
                    }
                }
            }

            foreach($images_info_array as $image_info)// Traverse the image info array
            {
                if(file_exists($image_info->image_full_path))// Check if the image exists
                {
                    $temp_path = $document_root.$image_info->image_path.'/temp/';// Define a temporary folder location

                    switch(file_exists($temp_path))// Check if the temporary folder exists
                    {
                        case true:// If it does recursively delete it
                            $files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($temp_path, RecursiveDirectoryIterator::SKIP_DOTS), RecursiveIteratorIterator::CHILD_FIRST);

                            foreach ($files as $fileinfo) 
                            {
                                $todo = ($fileinfo->isDir() ? 'rmdir' : 'unlink');
                                $todo($fileinfo->getRealPath());
                            }

                            rmdir($temp_path);
                        break;
                        case false:// If it does not exist create it
                            mkdir($temp_path, 0777);// If it doesnt create the temporary folder
                            break;
                    }              

                    // Define the convert command for recommended optimization of the image
                    $command = 'convert -thumbnail '.$image_info->new_size.' "'.$document_root.'/'.$image_info->image_full_path.'" "'.$document_root.''.$image_info->image_path.'/temp/'.$image_info->image_name.'" 2>&1';
                    $last_line = system($command, $return_value);// Run the defined command

                    if($debugging === true)
                    {
                        switch ($return_value)
                        {
                            case true:
                                echo '<li class="Normal_Message">';
                                echo '<span><b>MESSAGE - THE COMMAND COMPLETED SUCCESSFULLY</b></span>';
                                echo '<span><b>Command:</b> '.$command.'</span>';
                                echo '<span><b>Directory:</b> '.$item->image_full_path.'</span>';
                                echo '<span><b>Resized:</b> '.$item->new_size.'</span>';
                                echo '<span><b>Returned:</b> '.$return_value.'</span>';
                                echo '<span><b>Output:</b> '.$last_line.'</span>';
                                echo '</li>';
                                break;
                            case false;
                                $error = error_get_last();
                                echo '<li class="Error_Msg" style="display:block; height:auto;">';
                                echo '<span><b>## ERROR - THE COMMAND DID NOT COMPLETE ##</b></span>';
                                echo '<span><b>TYPE:</b> '.$error['type'].'</span>';
                                echo '<span><b>MESSAGE:</b> '.$error['message'].'</span>';
                                echo '<span><b>FILE:</b> '.$error['file'].'</span>';
                                echo '<span><b>LINE:</b> '.$error['line'].'</span>';
                                echo '</li>';
                                error_clear_last();
                                break;
                            default:
                                break;
                        }
                    }
                }
                else// If the file does not exist
                {
                    echo '<li class="Warning_Message" style="display:block; height:auto;">The file doesn\'t exist</li>';
                }

            }

            break;// END OF WRAPPER        
    }       

}
catch(Exception $Error_Message)
{
    echo $Error_Message;
}

echo '</ul>';

?>

Solution

  • This will parse that HTML and output the text you are looking for:

    $html = '<tr class="rules-details" style="display: none">
        <td colspan="4">
            <a href="/serve-scaled-images.html" class="rule-help btn hover-tooltip" data-tooltip-interactive data-tooltip-max-width="450" title="&lt;h4&gt;Serve scaled images&lt;/h4&gt;&lt;p&gt;Serving appropriately-sized images can save many bytes of data and improve the performance of your webpage, especially on low-powered (eg. mobile) devices.&lt;/p&gt;&lt;p class=&quot;rule-help-tooltip-more&quot;&gt;&lt;a href=&quot;/serve-scaled-images.html&quot;&gt;Read more&lt;/a&gt;&lt;/p&gt;"><i class="sprite-question"></i><span class="resp-hidden">What\'s this mean?</span></a>
            <div>
                <p>The following images are resized in HTML or CSS. Serving scaled images could save 1.3MiB (45% reduction).
                    <ul>
                        <li><a href="https://www.example.com/Pictures/thumbs/0029.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0029.jpg</a> is resized in HTML or CSS from 300x623 to 123x200. Serving a scaled image could save 51.3KiB (86% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumbs/0133.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0133.jpg</a> is resized in HTML or CSS from 300x578 to 135x200. Serving a scaled image could save 44.0KiB (84% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumbs/0075.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0075.jpg</a> is resized in HTML or CSS from 300x390 to 176x200. Serving a scaled image could save 43.2KiB (69% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumbs/0057.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0057.jpg</a> is resized in HTML or CSS from 300x436 to 174x200. Serving a scaled image could save 35.0KiB (73% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 31.4KiB (78% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.9KiB (78% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumb/thumb.png" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/thumb.png</a> is resized in HTML or CSS from 148x100 to 68x46. Serving a scaled image could save 30.7KiB (78% reduction).</li>
                        <li><a href="https://www.example.com/Pictures/thumbs/0093.jpg" target="_blank" rel="nofollow noopener noreferrer">https://www.example.com/Pictures/thumbs/0093.jpg</a> is resized in HTML or CSS from 300x458 to 138x200. Serving a scaled image could save 28.9KiB (79% reduction).</li>
                    </ul>
                </p>
            </div>
        </td>
    </tr>';
    
    $doc = new DOMDocument();
    $html = @$doc->loadHTML($html);
    $items = (new DOMXPath($doc))->query('//tr/td/div//ul/li');
    foreach ($items as $item) {
        $url = $item->firstChild->nodeValue;
        preg_match_all('/\d{1,3}x\d{1,3}/', $item->nodeValue, $matches);
        [$original, $resized] = $matches[0];
        printf('URL:%s Original: %s Resized: %s%s', $url, $original, $resized, PHP_EOL);
    }
    

    Outputs

    URL:https://www.example.com/Pictures/thumbs/0029.jpg Original: 300x623 Resized: 123x200
    URL:https://www.example.com/Pictures/thumbs/0133.jpg Original: 300x578 Resized: 135x200
    URL:https://www.example.com/Pictures/thumbs/0075.jpg Original: 300x390 Resized: 176x200
    URL:https://www.example.com/Pictures/thumbs/0057.jpg Original: 300x436 Resized: 174x200
    URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
    URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
    URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
    URL:https://www.example.com/Pictures/thumbs/thumb.png Original: 148x100 Resized: 68x46
    URL:https://www.example.com/Pictures/thumbs/0093.jpg Original: 300x458 Resized: 138x200