c++multithreadingvirtual-machinelibcurlfutex

Using libcurl in a multithreaded environment causes VERY slow performance related to DNS lookup


You'll have to pardon the rather large-ish block of code, but I believe this is a near-minimal reproduction of my problem. The problem is not isolated to example.com but persists across many other sites.

If I have 4 threads actively making network requests, curl works 100% fine.

If I add one more thread, that thread takes ~10x the time to execute. I feel like I must be missing something obvious, but it escapes me right now.

UPDATE with more information: These tests are in a virtual machine. Independent of the number of cores available to the machine, four of the requests take ~100ms, and the remainder take ~5500ms.

UPDATE 2: Actually, I was wrong in one aspect, it isn't always 4 / n-4 distribution -- when I've changed to 4 cores, sometimes I get a different result distribution (running on 1 core at least seemed relatively consistent) - here's a snippet of the results when the threads return their latency (ms) instead of their http code when running on a 4 core VM:

   191  191
   198  198  167
   209  208  202  208
   215  207  214  209  209
  5650  213 5649  222  193  207
   206  201  164  205  201  201  205
  5679 5678 5666 5678  216  173  205  175
  5691  212  179  206 5685 5688  211 5691 5680
  5681  199  210 5678 5663  213 5679  212 5666  428

UPDATE 3: I built curl and openssl from scratch, removed the locking (as openssl 1.1.0g doesn't require it) and the problem persists. ( Sanity check / verified by the following):

std::cout << "CURL:\n  " << curl_version_info(CURLVERSION_NOW)->ssl_version
          << "\n";
std::cout << "SSLEAY:\n  " << SSLeay_version(SSLEAY_VERSION) << "\n";

outputting:

CURL:                       
  OpenSSL/1.1.0g            
SSLEAY:                     
  OpenSSL 1.1.0g  2 Nov 2017

With example latencies:

   191  191
   197  197  196
   210  210  201  210
   212  212  199  200  165
  5656 5654  181  214  181  212
  5653 5651 5647  211  206  205  162
  5681 5674 5669  165  201  204  201 5681
  5880 5878 5657 5662  197  209 5664  173  174
  5906 5653 5664 5905 5663  173 5666  173  165  204

UPDATE 4: Setting CURLOPT_CONNECTTIMEOUT_MS equal to x makes x an upper limit on the time it takes to return.

UPDATE 5, MOST IMPORTANT:

Running the program under strace -T ./a.out 2>&1 | vim - with 5 threads, when the program only had 1 slow request, yielded two very slow lines. It was two calls to the same futex, one took way longer than the second, but both took longer than all other futex calls (most were 0.000011 ms, these two calls took 5.4 and 0.2 seconds to unlock).

Additionally, I verified that the slow-ness was entirely in curl_easy_perform.

futex(0x7efcb66439d0, FUTEX_WAIT, 3932, NULL) = 0 <5.390086>
futex(0x7efcb76459d0, FUTEX_WAIT, 3930, NULL) = 0 <0.204908>

Finally, after some looking around in source code, I found that the bug is somewhere in the DNS lookup. Replacing hostnames with IP addresses is a bandaid over the problem, wherever or whatever it is.

-----------


Below is my minimal reproduction / distillation of the issue, compiled with g++ -lpthread -lcurl -lcrypto main.cc, linked to the versions of openssl and libcurl built from source.

#include <chrono>
#include <iomanip>
#include <iostream>
#include <thread>
#include <vector>
#include <curl/curl.h>
#include <openssl/crypto.h>

size_t NoopWriteFunction(void *buffer, size_t size, size_t nmemb, void *userp) {
  return size * nmemb;
};

int GetUrl() {
  CURL *hnd = curl_easy_init();

  curl_easy_setopt(hnd, CURLOPT_URL, "https://www.example.com/");
  curl_easy_setopt(hnd, CURLOPT_HEADERFUNCTION, NoopWriteFunction);
  curl_easy_setopt(hnd, CURLOPT_WRITEFUNCTION, NoopWriteFunction);
  curl_easy_setopt(hnd, CURLOPT_SSH_KNOWNHOSTS, "/home/web/.ssh/known_hosts");

  CURLcode ret = curl_easy_perform(hnd);
  long http_code = 0;
  curl_easy_getinfo(hnd, CURLINFO_RESPONSE_CODE, &http_code);

  curl_easy_cleanup(hnd);
  hnd = NULL;
  if (ret != CURLE_OK) {
    return -ret;
  }
  return http_code;
}

int main() {
  curl_global_init(CURL_GLOBAL_ALL);

  for (int i = 1; i < 10; i++) {
    std::vector<std::thread> threads;
    int response_code[10]{};
    auto clock = std::chrono::high_resolution_clock();
    auto start = clock.now();
    threads.resize(i);
    for (int j = 0; j < i; j++) {
      threads.emplace_back(std::thread(
          [&response_code](int x) { response_code[x] = GetUrl(); }, j));
    }
    for (auto &t : threads) {
      if (t.joinable()) {
        t.join();
      }
    }
    auto end = clock.now();
    int time_to_execute =
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
            .count();
    std::cout << std::setw(10) << time_to_execute;
    for (int j = 0; j < i; j++) {
      std::cout << std::setw(5) << response_code[j];
    }
    std::cout << "\n";
  }
}

And when I run the program on my machine, I get the following result (I can change the domain to whatever, the results are ~ the same):

   123  200
    99  200  200
   113  200  200  200
   119  200  200  200  200
  5577  200  200  200  200  200
  5600  200  200  200  200  200  200
  5598  200  200  200  200  200  200  200
  5603  200  200  200  200  200  200  200  200
  5606  200  200  200  200  200  200  200  200  200

And here is my curl version and openssl version:

$curl --version
curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL
$ openssl version
OpenSSL 1.1.0f  25 May 2017

Solution

  • The bug is somewhere in the DNS resolution, as indicated by my UPDATE 5.

    It's something to do with the lookup for IPV6, somewhere in getaddrinfo.

    Searching around indicates this is usually an ISP issue, or an over-aggressive-packet-filtering issue, combined with something else (what, I don't know) that makes this a really really odd edge case.

    Following the instructions on this page leads to the following workaround / solution:

    curl_easy_setopt(curl, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
    

    Which eliminates the issue as I've perceived it. IPV6 is hard. :(