Context:
I run envoy with grpc-web. I have a bunch of gRPC servers to route to. Each server has a dedicated route and cluster (see config below). Envoy runs inside a docker-container with no special changes (only config and SSL). Envoy and the gRPC servers are connected via a docker network.
Problem:
Whenever I restart the envoy-container, it takes a few requests until the grpc-web calls go through and don't time out. This is after the envoy-container is 100% started. Leaving it running for longer does not prevent this (left it for hours). After these first ~3 failing requests everything works fine until the container is restarted.
Relevant configs:
I removed any obviously unnecessary config in the docker compose and otherwise condensed the config as much as possible (removed all the repeated parts for each server).
envoy:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
listener_filters:
- name: "envoy.filters.listener.tls_inspector"
typed_config: { }
filter_chains:
# Use HTTPS (TLS) encryption for ingress data
# Disable this to allow tools like bloomRPC which don't work via https
transport_socket:
name: envoy.transport_socket.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: "/etc/envoy/envoy.pem"
private_key:
filename: "/etc/envoy/envoy.key"
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: auto
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.file
# Logger for gRPC requests (can be identified by the presence of the "x-grpc-web"-header)
filter:
header_filter:
header:
name: "x-grpc-web"
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [gRPC-status: %GRPC_STATUS%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
- name: envoy.access_loggers.file
# Logger for HTTP(s) requests (everything that is not a gRPC request)
filter:
header_filter:
header:
name: "x-grpc-web"
invert_match: true
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/stdout
format: "[%START_TIME%] \"%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%\": \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\" -> \"%UPSTREAM_HOST%\" [http(s)-status: %RESPONSE_CODE%] (cluster: %UPSTREAM_CLUSTER% route: %ROUTE_NAME%)\n"
stream_idle_timeout: 43200s # 12h
route_config:
name: local_route
virtual_hosts:
- name: gRPC-Web-Proxy
domains: [ "*" ]
request_headers_to_add:
- header:
key: "source"
value: "envoy"
append: false
- header:
key: "downstream-address"
value: "%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%"
append: false
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,x-envoy-retry-grpc-on,x-envoy-max-retries,auth-token,x-real-ip,client-ip,x-forwarded-for,x-forwarded,x-cluster-client-ip,forwarded-for,forwarded
max_age: "1728000"
expose_headers: grpc-status,grpc-message
routes: # https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto
- name: grpcserver_gRPCRoute
match:
prefix: "/api/services.grpcserver"
route:
cluster: grpcserver_gRPCCluster
prefix_rewrite: "/services.grpcserver"
timeout: 0s # No timeout. Otherwise, streams will be aborted regularly
http_filters:
- name: envoy.filters.http.grpc_web
- name: envoy.filters.http.cors
- name: envoy.filters.http.router
clusters:
- name: grpcserver_gRPCCluster
connect_timeout: 0.25s
type: static
http2_protocol_options: { }
lb_policy: round_robin
load_assignment:
cluster_name: grpcserver_gRPCCluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 20001
transport_socket:
# Connect to microservice via TLS
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain: { "filename": "/etc/envoy/envoy.pem" }
private_key: { "filename": "/etc/envoy/envoy.key" }
# Validate CA of microservice
validation_context:
match_subject_alt_names:
trusted_ca:
filename: /etc/ssl/certs/ca-certificates.crt
docker-compose.yml:
version: '2.4'
networks:
core:
name: Service_Core
driver: bridge
ipam:
config:
- subnet: 198.51.100.0/24
gateway: 198.51.100.1
services:
envoy:
container_name: "envoy"
image: "envoyproxy/envoy:v1.17.1"
ports:
- 8080:8080
networks:
- core
restart: always
security_opt:
- apparmor:unconfined
environment:
- ENVOY_UID=17200
- ENVOY_GID=17200
volumes:
- "/somepath/envoy.pem:/etc/envoy/envoy.pem:ro"
- "/somepath/envoy.key:/etc/envoy/envoy.key:ro"
- "/somepath/ca.pem:/etc/ssl/certs/ca-certificates.crt:ro"
- "/somepath/envoy.yml:/etc/envoy/envoy.yaml:ro"
grpcserver:
image: "<grpcserver>"
container_name: "grpcserver"
restart: always
networks:
- core
security_opt:
- apparmor:unconfined
frontend:
image: "<frontend>" # an nginx with the files for the UI
container_name: "frontend"
restart: always
networks:
- core
ports:
- 80:80
- 443:443
volumes:
- "/somepath/ssl/:/opt/ssl/"
security_opt:
- apparmor:unconfined
What could be causing this behavior?
I am only interested in a fix regarding the docker or envoy config. I already considered using a workaround, but I would rather fix it instead.
In your cluster configuration, you specify a connect-timeout of 250ms. If the service does not respond within this timeframe, the call will fail.
It seems that the first call to the service isn't able to finish within this short timeframe, setting it to a higher value (a few seconds) should do the trick.