phpftpget-headers

Fastest way to check for remote file (image) existence


I've written a products syncing script between a local server running a merchant application and a remote web server hosting the store's eshop...

For the full sync option I need to sync about 5000+ products, with their images etc... Even with the size variations (where different product sizes - for example shoes) of the same product that share the same product image, I need to check the existence of around 3500 images...

So, for the first run, I uploaded through FTP all product images except for a couple of them, and let the script run to check if it would upload those couple of missing images...

The problem is that the script ran for 4 hours which is unacceptable... I mean, I didn't re-upload every image... It just checked every single image to determine whether it'd skip or upload it (through ftp_put()).

I was performing the check like this:

if (stripos(get_headers(DESTINATION_URL . "{$path}/{$file}")[0], '200 OK') === false) {

which is pretty fast, but obviously not fast enough for the sync to run for a logical amount of time...

How do you people handle such situations where you have to check the existence of a HUGE amount of remote files?


As a last resort, I've left to use the ftp_nlist() to download a list of the remote files and then write an algorithm to more or less do a file compare between the local and remote files...

I tried it, and it takes ages, literally 30+ mins, for the recursive algorithm to build the filelist... You see, my files are not in one single folder... The whole tree spans across 1,956 folders, and the filelist consists of 3,653 product image files and growing... Also note that I didn't even use the size "trick" (used in conjunction with ftp_nlist()) to determine whether a file is a file or a folder, but rather used the newer ftp_mlsd() which explicitly returns a type param that holds that info... You can read more here: PHP FTP recursive directory listing


Solution

  • curl_multi is probably the fastest way. unfortunately curl_multi is rather difficult to use, an example helps a lot imo. checking urls between 2x 1gbps dedicated servers in 2 different datacenters in Canada, this script manage to check around 3000 urls per second, using 500 concurrent tcp connections (and it can be made even faster by re-using curl handles instead of open+close)

    <?php
    declare(strict_types=1);
    $urls=array();
    for($i=0;$i<100000;++$i){
        $urls[]="http://ratma.net/";
    }
    validate_urls($urls,500,1000,false,false,false);    
    // if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
    // otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
    function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
    {
        if ($max_connections < 1) {
            throw new InvalidArgumentException("max_connections MUST be >=1");
        }
        foreach ($urls as $key => $foo) {
            if (!is_string($foo)) {
                throw new \InvalidArgumentException("all urls must be strings!");
            }
            if (empty($foo)) {
                unset($urls[$key]); //?
            }
        }
        unset($foo);
        // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
        $ret = array();
        $mh = curl_multi_init();
        $workers = array();
        $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
            // > If an added handle fails very quickly, it may never be counted as a running_handle
            while (1) {
                curl_multi_exec($mh, $still_running);
                if ($still_running < count($workers)) {
                    break;
                }
                $cms=curl_multi_select($mh, 10);
                //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
            }
            while (false !== ($info = curl_multi_info_read($mh))) {
                //echo "NOT FALSE!";
                //var_dump($info);
                {
                    if ($info['msg'] !== CURLMSG_DONE) {
                        continue;
                    }
                    if ($info['result'] !== CURLM_OK) {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                        }
                    } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                        }
                    } else {
                        $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                        if ($code[0] === "3") {
                            if ($consider_http_300_redirect_as_error) {
                                if ($return_fault_reason) {
                                    $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                                }
                            } else {
                                if ($return_fault_reason) {
                                    $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                                } else {
                                    $ret[] = $workers[(int)$info['handle']];
                                }
                            }
                        } elseif ($code[0] === "2") {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        } else {
                            // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                            }
                        }
                    }
                    curl_multi_remove_handle($mh, $info['handle']);
                    assert(isset($workers[(int)$info['handle']]));
                    unset($workers[(int)$info['handle']]);
                    curl_close($info['handle']);
                }
            }
            //echo "NO MORE INFO!";
        };
        foreach ($urls as $url) {
            while (count($workers) >= $max_connections) {
                //echo "TOO MANY WORKERS!\n";
                $work();
            }
            $neww = curl_init($url);
            if (!$neww) {
                trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
                if ($return_fault_reason) {
                    $ret[$url] = array(false, -1, "curl_init() failed");
                }
                continue;
            }
            $workers[(int)$neww] = $url;
            curl_setopt_array($neww, array(
                CURLOPT_NOBODY => 1,
                CURLOPT_SSL_VERIFYHOST => 0,
                CURLOPT_SSL_VERIFYPEER => 0,
                CURLOPT_TIMEOUT_MS => $timeout_ms
            ));
            curl_multi_add_handle($mh, $neww);
            //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
        }
        while (count($workers) > 0) {
            //echo "WAITING FOR WORKERS TO BECOME 0!";
            //var_dump(count($workers));
            $work();
        }
        curl_multi_close($mh);
        return $ret;
    }