dockersocketskubernetesredisbeanstalkd

Beanstalkd prot.c:710 in check_err: read(): Connection reset by peer - slow socket communication in k8s cluster


In our kubernetes cluster we have 3 BE services (hosted on GCP).

We have 1 instance of Beanstalkd service & 1 ES cluster (3 master pods, 2 data pods, 2 client pods). Each BE app has its own Redis instance and Mysql server.

Now we built various health checks and we regularly see (like 5-10 times a day) connection speed issues. In the health check all we do is connect to the service and check how long this takes (via php) and if it takes too long (or doesn't connect at all) we raise a flag.

We see results like:

Test took 10.616209983826 sec at 2020-04-28 23:45:11.
Environment: production
Test took too long - more than 2 seconds.
Checked items and their timing:
pdo took 0.0043079853057861 seconds
beanstalkd took 10.036059141159 seconds
redis took 0.0028140544891357 seconds
ElasticSeearch took 0.57300901412964 seconds

Sometimes it is Redis taking this long, sometimes it is ES. But mostly it is Beanstalkd (up to 20 seconds!). This is a code sample how we check it:

$startbeanstalkd = microtime(true);
// connect to Beanstalkd
try {
    $queue = new Beanstalk(
        [
            'host' => $config->beanstalkd->host,
            'port' => $config->beanstalkd->port
        ]
    );
    $queue->connect();
} catch (\Exception $e) {
    $testPassed = false;
    $testResult['Beanstalkd']['status'] = false;
    $testResult['Beanstalkd']['message'] = $e->getMessage();
    $testResult['Beanstalkd']['connectionDetails'] = json_encode($config->beanstalkd);
}
if($testPassed) $queue->disconnect();
$beantime = microtime(true);
$timetrack['beanstalkd'] = $beantime - $startbeanstalkd;

We noticed this error from beanstalkd regulary:

/usr/bin/beanstalkd: prot.c:710 in check_err: read(): Connection reset by peer

But searching on this hasn't really given us much info.

Note really relevant I think but we use Rancher 2 to manage our clusters.

First I'd love to know what the beanstalkd error means and how to solve it

Then secondly would love to hear any suggestion on the general sockets time out / are slow-as-shit problem.

Thank you!


Solution

  • Problem was that we upgraded pheanstalk (php lib to veresion 4 from 3.x) to a new version and their documentation is/was bad.

    A previous accepted connection method now wasn't accepted anymore but the method name was the same, just it now requires an argument but it doesn't give any error if you don't provide that argument. Now if you don't give that argument you create zombie connections. So we were racking up zombie connections eating up all sockets in whole cluster bringing everything to a stop.