I'm trying to make a gRPC service (thanos sidecar) externally accessible over a domain in my kubernetes cluster (k3s cluster). I am using Traefik as an ingress controller.
Any clues as to what I may be misconfiguring would be much appreciated. I am really unclear where the problem lies, be it in the NLB in amazon (do I need something specific for grpc or can I just use TCP & port 80/443?), the Traefik ingress or the service itself.
I have been unsuccessful in finding any errors from traefik logs or service misconfiguration.
The gRPC service is deployed in the cluster as a sidecar container of a Prometheus deployment. This is being deployed using the kube-prometheus-stack helm chart.
$ kubectl describe pod prometheus-monitoring-prometheus-0 -n monitoring
Name: prometheus-monitoring-prometheus-0
Namespace: monitoring
Priority: 0
Service Account: monitoring-prometheus
Node: k3s-node-1/12.345.678.910
Start Time: Wed, 26 Jul 2023 18:35:38 +0000
Labels: app.kubernetes.io/instance=monitoring-prometheus
app.kubernetes.io/managed-by=prometheus-operator
app.kubernetes.io/name=prometheus
...
prometheus=monitoring-prometheus
statefulset.kubernetes.io/pod-name=prometheus-monitoring-prometheus-0
Annotations: kubectl.kubernetes.io/default-container: prometheus
Status: Running
IP: 10.42.0.200
IPs:
IP: 10.42.0.200
Controlled By: StatefulSet/prometheus-monitoring-prometheus
...
Containers:
...
thanos-sidecar:
Container ID: containerd://bdc1bbfe53bf1ea260c47a44ab26110432388fe5592e037c83da5c6b6c5f696f
Image: http://quay.io/thanos/thanos:v0.31.0
Image ID: quay.io/thanos/thanos@sha256:e7d337d6ac24233f0f9314ec9830291789e16e2b480b9d353be02d05ce7f2a7e
Ports: 10902/TCP, 10901/TCP
Host Ports: 0/TCP, 0/TCP
Args:
sidecar
--prometheus.url=http://127.0.0.1:9090/
--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}
--grpc-address=:10901
--http-address=:10902
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/prometheus
--log.level=info
--log.format=logfmt
State: Running
Started: Wed, 26 Jul 2023 18:35:41 +0000
Ready: True
Restart Count: 0
Environment:
OBJSTORE_CONFIG: <set to the key 'objstore.yml' in secret 'my-s3-bucket'> Optional: false
Mounts:
/prometheus from prometheus-monitoring-prometheus-db (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-slz8t (ro)
...
The sidecar container is then exposed specifically using a service
$ kubectl describe svc monitoring-thanos-discovery -n monitoring
Name: monitoring-thanos-discovery
Namespace: monitoring
Labels: app=monitoring-thanos-discovery
app.kubernetes.io/instance=monitoring
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/part-of=monitoring
app.kubernetes.io/version=47.2.0
chart=kube-prometheus-stack-47.2.0
heritage=Helm
release=monitoring
Annotations: meta.helm.sh/release-name: monitoring
meta.helm.sh/release-namespace: monitoring
traefik.ingress.kubernetes.io/service.serversscheme: h2c
Selector: app.kubernetes.io/name=prometheus,prometheus=monitoring-prometheus
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: None
IPs: None
Port: grpc 10901/TCP
TargetPort: grpc/TCP
Endpoints: 10.42.0.200:10901
Port: http 10902/TCP
TargetPort: http/TCP
Endpoints: 10.42.0.200:10902
Session Affinity: None
Events: <none>
I am using an Ingress (default) to create a TLS certificate for my domain and an IngressRoute (traefik specific) to expose the service via what I believe to be HTTP2 capable endpoint.
thanos-ingress-dummy.yaml
# We use this resource to get a certificate for the given domain (To use with ingressroute)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: thanos-discovery-ingress-dummy
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
rules:
- host: "thanos-gateway.monitoring.domain.com"
http:
paths:
- path: /cert-placeholder
pathType: Prefix
backend:
service:
name: monitoring-thanos-discovery
port:
name: grpc
tls:
- hosts:
- "thanos-gateway.monitoring.domain.com"
secretName: thanos-sidecar-grpc-tls
thanos-ingressroute.yaml
# We use IngressRoute to allow our grpc server to be reachable. (Supports grpc over http2)
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: thanos-discovery-ingress
spec:
entryPoints:
- websecure
routes:
- match: Host(`thanos-gateway.monitoring.domain.com`)
kind: Rule
services:
- name: monitoring-thanos-discovery
port: grpc
tls:
secretName: thanos-sidecar-grpc-tls
Here's a picture of what this should look like right now.
The gRPC service is not reachable from outside the cluster over the specified domain.
From within a container inside the cluster, I am able to communicate with the server using grpcurl against the monitoring-thanos-discovery
service using the internal cluster DNS.
$ kubectl exec -it debian-debug -- bash
root@debian-debug:/# grpcurl -plaintext monitoring-thanos-discovery.monitoring.svc.cluster.local:10901 grpc.health.v1.Health.Check
{
"status": "SERVING"
}
When I try the same from outside the cluster against the domain I have specified in the ingresses (thanos-gateway.monitoring.domain.com), I get the following.
$ grpcurl --plaintext thanos-gateway.monitoring.domain.com:443 list
Failed to list services: server does not support the reflection API
When I do a curl request against the endpoint I can verify that the request is being handled by Traefik, however an Internal Server Error response is given. Curling against the http endpoint results in 404, which is expected given the fact that I only specified websecure
in my ingress. I had previously also had web
specified in the ingress with the same response from grpc and curl as 443 port.
$ curl https://thanos-gateway.monitoring.domain.com
Internal Server Error
$ curl http://thanos-gateway.monitoring.domain.com
404 page not found
To answer my own question, the issue was twofold.
--plaintext
when the only available endpoint uses TLS results in the below response. Meaning, --plaintext
should be left out of the command when you have configured your route to use TLS.Failed to list services: server does not support the reflection API
What I changed
I do not need the "fake" ingress (thanos-ingress-dummy.yaml) because I already have a wildcard certificate for *.domain.com
I changed the domain to thanos-grpc.domain.com to use the already existing tls cert (otherwise the old approach of making a fake ingress would probably still work, but I haven't checked)
The new thanos-ingressroute.yaml
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: thanos
namespace: monitoring
spec:
entryPoints:
- websecure
routes:
- match: Host(`thanos-grpc.domain.com`)
kind: Rule
services:
- name: monitoring-thanos-discovery
namespace: monitoring
port: 10901
scheme: h2c
passHostHeader: true
tls:
secretName: my-domain-wildcard-tls
This is the response I now get calling the configured domain.
$ grpcurl thanos-grpc.domain.com:443 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info
NOTE THAT I AM NOT USING --plaintext
FLAG ANYMORE.
If I use the --plaintext I get the same old response Failed to list services: server does not support the reflection API.