phpguzzlehttp-status-code-403

What is a legitimate way in PHP to test if a third-party site is working?


We host a set of "resource" pages - a collection of useful links for our users. For years we've had a script run daily - looping through each link and sending one php Guzzle HEAD request to make sure each page on the resource sites is active.

But over the past few years, I suspect as more and more sites adopt Cloudflare, sites are returning 403 codes to the HEAD request, and it's getting to the point where it's pretty useless to do this.

Is there a way to do this that isn't going to get this traffic treated as malicious? I don't need the content from the other sites... just simply to know if the pages are in good working order.

Here's the PHP code I'm using:

$client = new Client();
$request = $client->head($encoded_link);
$request->setOptions(['userAgent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36']);
$response = $request->send();

Solution

  • There are a number of points that should be able to help you, and different ways of proceeding depending on your needs.

    1. If the number of resources you want to check is not too high you might use some monitoring services Tools like UptimeRobot, Pingdom.

    2. For the most realistic approach, you may consider using a headless browser through PHP libraries like chrome-php, php-webdriver or Symfony Panther, which would interact with sites just like a real browser. It takes a bit of work at first, but it will be very effective.

    3. Your script can be improved:

      1. Use GET instead of HEAD requests
        Many security systems are more suspicious of HEAD requests since they're commonly used by automated tools but rarely by real users. Switching to GET requests might help:
        $request = $client->get($encoded_link);

      2. Improve your user agent string

        Your current user agent is somewhat outdated (Chrome 61). Use a more recent browser signature:

        $options = [
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
            ]
        ];
        $request = $client->get($encoded_link, $options);
        
      3. Add realistic headers

        Include headers that typical browsers would send:

        $options = [
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
                'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
                'Accept-Language' => 'en-US,en;q=0.9',
                'Accept-Encoding' => 'gzip, deflate, br',
                'Connection' => 'keep-alive',
                'Upgrade-Insecure-Requests' => '1',
                'Sec-Fetch-Dest' => 'document',
                'Sec-Fetch-Mode' => 'navigate',
                'Sec-Fetch-Site' => 'none',
                'Sec-Fetch-User' => '?1'
            ]
        ];