websockettraefiksticky-session

Traefik sticky sessions and WebSockets


I'm using Traefik 2 in Docker Swarm mode for an application requiring sticky sessions. This application uses WebSockets. Sporadically, a user is logged out immediately after login for no obvious reason. I managed to catch a Chrome network trace when this happened, and saw a message on the WebSocket triggering the logout.

Given that this has only been reported in an environment with replication on the server and not in lower dev/QA environments with a single server instance, I suspected something around load balancing and session persistence.

I noticed the following:

Typically, the value of the Traefik sticky cookie is the Docker-internal IP address of the replica the session is bound to, like:

traefiksticky=http://10.0.24.149:8080; Path=/

In the problem scenario, though, the initial value of this cookie (set in response to a regular HTTP request, not WebSocket) has a different type of value:

traefiksticky=7361779c89a76cdc; Path=/

On the request establishing the WebSocket, this cookie is sent, but the response contains a Set-Cookie header changing the value of the cookie to the 'IP address' format:

traefiksticky=http://10.0.24.149:8080; Path=/

Shortly after establishment, a message is sent to the browser on the WebSocket indicating 'session timeout'. My guess is that the WebSocket connection was made to a different replica on the server side, and that's just the response sent for an unrecognized session.

(There's a flow at application startup that actually results in two WebSockets being established, one after the other, as the user proceeds through a couple of screens. I've seen the issue described above affect the first WebSocket, and I've also seen the first WebSocket work OK but the second one have the issue described above - so whatever is going on doesn't seem to affect all WebSocket connections.)

So, I'm wondering:

EDIT: One other variable is that Traefik itself is replicated in this environment, while it's not in the dev and QA environments. I've had a hard time finding documentation on such a configuration and whether there's an impact on sticky sessions.


Solution

  • Through some as-yet-unexplained Docker Swarm chicanery, our three replicas of Traefik were not running the same version. Some were running version 2.1, which uses the 'IP address' scheme for the cookie value, while one was running 2.9.6, which uses the other scheme. We have an external load balancer in front of the three Traefik replicas, so we could end up with an intermixing of these two conventions, leading to the problem described.

    I've had some stern words with Docker Swarm and now all are on the newer version; I'm hopeful we'll no longer see the problem.