phpcurlweb-scrapingtrustpilot

PHP - Scrape data of all trustpilot reviews


<?php 
for ($x = 0; $x <= 25; $x++) {

$ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
//curl_setopt($ch, CURLOPT_POST, true);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0); 
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
$trustpilot = curl_exec($ch);

// Check if any errorccurred
if(curl_errno($ch))
{
     die('Fatal Error Occoured');
}

} 
?>

This code will get all 25 pages of reviews for example.com, what I then want to do is then put all the results into a JSON array or something.

I attempted the code below in order to maybe retrieve all of the names:

<?php
$trustpilot = preg_replace('/\s+/', '', $trustpilot); //This replaces any spaces with no spaces
$first = explode( '"name":"' , $trustpilot );
$second = explode('"' , $first[1] );
$result = preg_replace('/[^a-zA-Z0-9-.*_]/', '', $second[0]);    //Don't allow special characters

?>

This is clearly a lot harder than I anticipated, does anyone know how I could possibly get all of the reviews into JSON or something for however many pages I choose, for example in this case I choose 25 pages worth of reviews.

Thanks!


Solution

  • do not parse HTML with regex.

    use DOMDocument & DOMXPath to parse em. also, you create a new curl handle for each page, but you never close them, which is a resource/memory leak in your code, but also a waste of cpu because you could just keep re-using the same curl handle over and over (instead of creating a new curl handle for each page, which takes cpu), and protip: this html compress rather well, so you should use CURLOPT_ENCODING to download the pages compressed, e.g:

    <?php
    declare(strict_types = 1);
    header("Content-Type: text/plain;charset=utf-8");
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_ENCODING, ''); // enables compression
    $reviews = [];
    for ($x = 0; $x <= 25; $x ++) {
        curl_setopt($ch, CURLOPT_URL, "https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
        // curl_setopt($ch, CURLOPT_POST, true);
        // curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        // curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
        curl_setopt($ch, CURLOPT_TIMEOUT, 30); // timeout in seconds
        $trustpilot = curl_exec($ch);
    
        // Check if any errorccurred
        if (curl_errno($ch)) {
            die('fatal error: curl_exec failed, ' . curl_errno($ch) . ": " . curl_error($ch));
        }
        $domd = @DOMDocument::loadHTML($trustpilot);
        $xp = new DOMXPath($domd);
        foreach ($xp->query("//article[@class='review-card']") as $review) {
            $id = $review->getAttribute("id");
            $reviewer = $xp->query(".//*[@class='content-section__consumer-info']", $review)->item(0)->textContent;
            $stars = $xp->query('.//div[contains(@class,"star-item")]', $review)->length;
            $title = $xp->query('.//*[@class="review-info__body__title"]', $review)->item(0)->textContent;
            $text = $xp->query('.//*[@class="review-info__body__text"]', $review)->item(0)->textContent;
            $reviews[$id] = array(
                'reviewer' => mytrim($reviewer),
                'stars' => ($stars),
                'title' => mytrim($title),
                'text' => mytrim($text)
            );
        }
    }
    curl_close($ch);
    echo json_encode($reviews, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_NUMERIC_CHECK);
    
    
    function mytrim(string $text): string
    {
        return preg_replace("/\s+/", " ", trim($text));
    }
    

    output:

    {
        "4d6bbf8a0000640002080bc2": {
            "reviewer": "Clement Skau Århus, DK, 3 reviews",
            "stars": 5,
            "title": "Godt fundet på!",
            "text": "Det er rigtig fint gjort at lave et example domain. :)"
        }
    }
    

    because there is only 1 review here for the url you listed. and 4d6bbf8a0000640002080bc2 is the website's internal id (probably a sql db id) for that review.