phpcurlphp-5.4curl-multi

PHP Multi-cURL requests delayed until timeout


Summary

I have some PHP 5.4 code which fetches a batch of Facebook/Instagram photos in parallel using multi curl. This code has been working for years, and nothing has changed as far as I can tell.

I add multiple curl requests to a 'multi' request. Each curl request gets a CURLOPT_TIMEOUT. The problem I'm seeing is that, all of a sudden, some of my requests don't complete until this timeout is reached (no matter what timeout I set).

Code

I do something like this (simplified):

do {
    while (CURLM_CALL_MULTI_PERFORM === curl_multi_exec($mh, $running));

    // Wait for activity on any curl-connection (optional, reduces CPU)
    curl_multi_select($mh);

    // a request was just completed -- find out which one
    while($done = curl_multi_info_read($mh))
    {
        $completedCurlRequest = $done['handle'];

        //save the file
        do_some_work(completedCurlRequest);

        curl_multi_remove_handle($mh, $completedCurlRequest);
    }
} while ($running);

I use this script to run batches of about 40 parallel requests to fetch some images (from Facebook). Most of them take about 500ms to complete. However, a few of the requests "hang" (until the CURLOPT_TIMEOUT) before they arrive.

Basically the curl_multi_select step takes the entire timeout. Or, if I remove that curl_multi_select line, the outer loop spins (burning CPU) until the timeout.

Considerations

Questions

  1. Am I doing something wrong in my use of multi-curl that could cause this? (but if so, what's changed?)
  2. Have Facebook and Instagram changed anything that might cause this?
  3. Could something on my server have changed to trigger this?
  4. How can I debug this?

Update Here is the what I get back from a slow request when it finally completes:

INFO

"content_type": "image/jpeg",
"http_code": 200,
"header_size": 377,
"request_size": 180,
"total_time": 15.001012,    //<----- Total time == CURLOPT_TIMEOUT
"namelookup_time": 0.007149,
"connect_time": 0.12018,
"pretransfer_time": 0.441911,
"size_download": 40714,
"speed_download": 2714,
"download_content_length": -1,   //<------Not set

HEADER

HTTP/2 200 
content-type: image/jpeg
x-haystack-needlechecksum: 3529661797
timing-allow-origin: *
access-control-allow-origin: *
cache-control: max-age=1209600, no-transform
date: Mon, 04 Feb 2019 14:04:17 GMT
access-control-expose-headers: X-FB-CEC-Video-Limit

It is missing the content-length header, but that always seems to be the case the first time a file is fetched. Only 1 or 2 of the 50 parallel requests are slow, yet all of the requests are missing their content length headers.

If I fetch the same file again, it is much quicker, and I do see content length being set this time

INFO

"download_content_length": 52721,

HEADER

content-length: 52721           

Solution

  • My current theory is that there is a bug in Facebook fileserver that means the connection is sometimes not being closed even though the data has been sent, so the connection stays open until the timeout. In the absence of the (optional) content-length header being sent by Facebook's fileserver, cURL can't know if the payload is complete, and so hangs.

    My current solution is to 'prime' the fileserver by first making a request for the image without a body, like this:

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_NOBODY, 1);
    curl_exec($ch);
    

    This is a pretty quick process, since there is no image being returned. I actually do this in the background using asynchronous multi curl, so I can get on with doing some other processing.

    After priming the fileserver, subsequent requests for the files are even quicker than before, as the content-length is known.

    This is a bit of a clumsy approach, but in the absence of any response from Facebook so far I'm not sure what else to do.