docker-composeenvoyproxygrpc-web

envoy: grpc-web takes a few requests until it no longer times out


Context:
I run envoy with grpc-web. I have a bunch of gRPC servers to route to. Each server has a dedicated route and cluster (see config below). Envoy runs inside a docker-container with no special changes (only config and SSL). Envoy and the gRPC servers are connected via a docker network.

Problem:
Whenever I restart the envoy-container, it takes a few requests until the grpc-web calls go through and don't time out. This is after the envoy-container is 100% started. Leaving it running for longer does not prevent this (left it for hours). After these first ~3 failing requests everything works fine until the container is restarted.

Relevant configs:
I removed any obviously unnecessary config in the docker compose and otherwise condensed the config as much as possible (removed all the repeated parts for each server).

envoy:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 8080 }
      listener_filters:
        - name: "envoy.filters.listener.tls_inspector"
          typed_config: { }
      filter_chains:
        # Use HTTPS (TLS) encryption for ingress data
        # Disable this to allow tools like bloomRPC which don't work via https
        transport_socket:
          name: envoy.transport_socket.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
            common_tls_context:
              tls_certificates:
              - certificate_chain:
                  filename: "/etc/envoy/envoy.pem"
                private_key:
                  filename: "/etc/envoy/envoy.key"
        filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              codec_type: auto
              stat_prefix: ingress_http
              access_log:
                - name: envoy.access_loggers.file
                  # Logger for gRPC requests (can be identified by the presence of the "x-grpc-web"-header)
                  filter:
                    header_filter:
                      header:
                        name: "x-grpc-web"
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                    path: /dev/stdout
                    format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [gRPC-status: %GRPC_STATUS%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
                - name: envoy.access_loggers.file
                  # Logger for HTTP(s) requests (everything that is not a gRPC request)
                  filter:
                    header_filter:
                      header:
                        name: "x-grpc-web"
                        invert_match: true
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
                    path: /dev/stdout
                    format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [http(s)-status: %RESPONSE_CODE%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
              stream_idle_timeout: 43200s # 12h
              route_config:
                name: local_route
                virtual_hosts:
                  - name: gRPC-Web-Proxy
                    domains: [ "*" ]
                    request_headers_to_add:
                      - header:
                          key: "source"
                          value: "envoy"
                        append: false
                      - header:
                          key: "downstream-address"
                          value: "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
                        append: false
                    cors:
                      allow_origin_string_match:
                        - prefix: "*"
                      allow_methods: GET, PUT, DELETE, POST, OPTIONS
                      allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,x-envoy-retry-grpc-on,x-envoy-max-retries,auth-token,x-real-ip,client-ip,x-forwarded-for,x-forwarded,x-cluster-client-ip,forwarded-for,forwarded
                      max_age: "1728000"
                      expose_headers: grpc-status,grpc-message
                    routes: # https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto
                      - name: grpcserver_gRPCRoute
                        match:
                          prefix: "/api/services.grpcserver"
                        route:
                          cluster: grpcserver_gRPCCluster
                          prefix_rewrite: "/services.grpcserver"
                          timeout: 0s                     # No timeout. Otherwise, streams will be aborted regularly
              http_filters:
                - name: envoy.filters.http.grpc_web
                - name: envoy.filters.http.cors
                - name: envoy.filters.http.router
  clusters:
    - name: grpcserver_gRPCCluster
      connect_timeout: 0.25s
      type: static
      http2_protocol_options: { }
      lb_policy: round_robin
      load_assignment:
        cluster_name: grpcserver_gRPCCluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 20001
      transport_socket:
        # Connect to microservice via TLS
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            tls_certificates:
              - certificate_chain: { "filename": "/etc/envoy/envoy.pem" }
                private_key: { "filename": "/etc/envoy/envoy.key" }
            # Validate CA of microservice
            validation_context:
              match_subject_alt_names:
              trusted_ca:
                filename: /etc/ssl/certs/ca-certificates.crt

docker-compose.yml:

version: '2.4'
networks:
  core:
    name: Service_Core
    driver: bridge
    ipam:
      config:
      - subnet: 198.51.100.0/24
        gateway: 198.51.100.1
services:
  envoy:
    container_name: "envoy"
    image: "envoyproxy/envoy:v1.17.1"
    ports:
      - 8080:8080
    networks:
      - core
    restart: always
    security_opt:
      - apparmor:unconfined
    environment:
      - ENVOY_UID=17200
      - ENVOY_GID=17200
    volumes:
      - "/somepath/envoy.pem:/etc/envoy/envoy.pem:ro"
      - "/somepath/envoy.key:/etc/envoy/envoy.key:ro"
      - "/somepath/ca.pem:/etc/ssl/certs/ca-certificates.crt:ro"
      - "/somepath/envoy.yml:/etc/envoy/envoy.yaml:ro"

  grpcserver:
    image: "<grpcserver>"
    container_name: "grpcserver"
    restart: always
    networks:
      - core
    security_opt:
      - apparmor:unconfined

  frontend:
    image: "<frontend>" # an nginx with the files for the UI
    container_name: "frontend"
    restart: always
    networks:
      - core
    ports:
     - 80:80
     - 443:443
    volumes:
      - "/somepath/ssl/:/opt/ssl/"
    security_opt:
      - apparmor:unconfined

What could be causing this behavior?
I am only interested in a fix regarding the docker or envoy config. I already considered using a workaround, but I would rather fix it instead.


Solution

  • In your cluster configuration, you specify a connect-timeout of 250ms. If the service does not respond within this timeframe, the call will fail.

    It seems that the first call to the service isn't able to finish within this short timeframe, setting it to a higher value (a few seconds) should do the trick.