amazon-web-serviceskubernetesgrpctraefik-ingressthanos

Exposing grpc server with traefik ingress on kubernetes cluster


I'm trying to make a gRPC service (thanos sidecar) externally accessible over a domain in my kubernetes cluster (k3s cluster). I am using Traefik as an ingress controller.

Any clues as to what I may be misconfiguring would be much appreciated. I am really unclear where the problem lies, be it in the NLB in amazon (do I need something specific for grpc or can I just use TCP & port 80/443?), the Traefik ingress or the service itself.

I have been unsuccessful in finding any errors from traefik logs or service misconfiguration.

Environment

The gRPC service is deployed in the cluster as a sidecar container of a Prometheus deployment. This is being deployed using the kube-prometheus-stack helm chart.

$ kubectl describe pod prometheus-monitoring-prometheus-0 -n monitoring
Name:             prometheus-monitoring-prometheus-0
Namespace:        monitoring
Priority:         0
Service Account:  monitoring-prometheus
Node:             k3s-node-1/12.345.678.910
Start Time:       Wed, 26 Jul 2023 18:35:38 +0000
Labels:           app.kubernetes.io/instance=monitoring-prometheus
                  app.kubernetes.io/managed-by=prometheus-operator
                  app.kubernetes.io/name=prometheus
                  ...
                  prometheus=monitoring-prometheus
                  statefulset.kubernetes.io/pod-name=prometheus-monitoring-prometheus-0
Annotations:      kubectl.kubernetes.io/default-container: prometheus
Status:           Running
IP:               10.42.0.200
IPs:
  IP:           10.42.0.200
Controlled By:  StatefulSet/prometheus-monitoring-prometheus
...
Containers:
  ...
  thanos-sidecar:
    Container ID:  containerd://bdc1bbfe53bf1ea260c47a44ab26110432388fe5592e037c83da5c6b6c5f696f
    Image:         http://quay.io/thanos/thanos:v0.31.0 
    Image ID:      quay.io/thanos/thanos@sha256:e7d337d6ac24233f0f9314ec9830291789e16e2b480b9d353be02d05ce7f2a7e
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://127.0.0.1:9090/
      --prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}
      --grpc-address=:10901
      --http-address=:10902
      --objstore.config=$(OBJSTORE_CONFIG)
      --tsdb.path=/prometheus
      --log.level=info
      --log.format=logfmt
    State:          Running
      Started:      Wed, 26 Jul 2023 18:35:41 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      OBJSTORE_CONFIG:  <set to the key 'objstore.yml' in secret 'my-s3-bucket'>  Optional: false
    Mounts:
      /prometheus from prometheus-monitoring-prometheus-db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-slz8t (ro)
...

The sidecar container is then exposed specifically using a service

$ kubectl describe svc monitoring-thanos-discovery -n monitoring
Name:              monitoring-thanos-discovery
Namespace:         monitoring
Labels:            app=monitoring-thanos-discovery
                   app.kubernetes.io/instance=monitoring
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/part-of=monitoring
                   app.kubernetes.io/version=47.2.0
                   chart=kube-prometheus-stack-47.2.0
                   heritage=Helm
                   release=monitoring
Annotations:       meta.helm.sh/release-name: monitoring
                   meta.helm.sh/release-namespace: monitoring
                   traefik.ingress.kubernetes.io/service.serversscheme: h2c
Selector:          app.kubernetes.io/name=prometheus,prometheus=monitoring-prometheus
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              grpc  10901/TCP
TargetPort:        grpc/TCP
Endpoints:         10.42.0.200:10901
Port:              http  10902/TCP
TargetPort:        http/TCP
Endpoints:         10.42.0.200:10902
Session Affinity:  None
Events:            <none>

I am using an Ingress (default) to create a TLS certificate for my domain and an IngressRoute (traefik specific) to expose the service via what I believe to be HTTP2 capable endpoint.

thanos-ingress-dummy.yaml

# We use this resource to get a certificate for the given domain (To use with ingressroute)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-discovery-ingress-dummy
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  rules:
    - host: "thanos-gateway.monitoring.domain.com"
      http:
        paths:
          - path: /cert-placeholder
            pathType: Prefix
            backend:
              service:
                name: monitoring-thanos-discovery
                port:
                  name: grpc
  tls:
    - hosts:
        - "thanos-gateway.monitoring.domain.com"
      secretName: thanos-sidecar-grpc-tls

thanos-ingressroute.yaml

# We use IngressRoute to allow our grpc server to be reachable. (Supports grpc over http2)
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: thanos-discovery-ingress
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`thanos-gateway.monitoring.domain.com`)
      kind: Rule
      services:
        - name: monitoring-thanos-discovery
          port: grpc
  tls:
    secretName: thanos-sidecar-grpc-tls

Here's a picture of what this should look like right now.

Thanos querier talking to external sidecar

Problem

The gRPC service is not reachable from outside the cluster over the specified domain.

From within a container inside the cluster, I am able to communicate with the server using grpcurl against the monitoring-thanos-discovery service using the internal cluster DNS.

$ kubectl exec -it debian-debug -- bash
root@debian-debug:/# grpcurl -plaintext monitoring-thanos-discovery.monitoring.svc.cluster.local:10901 grpc.health.v1.Health.Check
{
  "status": "SERVING"
}

When I try the same from outside the cluster against the domain I have specified in the ingresses (thanos-gateway.monitoring.domain.com), I get the following.

$ grpcurl --plaintext thanos-gateway.monitoring.domain.com:443 list
Failed to list services: server does not support the reflection API

When I do a curl request against the endpoint I can verify that the request is being handled by Traefik, however an Internal Server Error response is given. Curling against the http endpoint results in 404, which is expected given the fact that I only specified websecure in my ingress. I had previously also had web specified in the ingress with the same response from grpc and curl as 443 port.

$ curl https://thanos-gateway.monitoring.domain.com
Internal Server Error

$ curl http://thanos-gateway.monitoring.domain.com
404 page not found

Solution

  • To answer my own question, the issue was twofold.

    1. Calling grpcurl with --plaintext when the only available endpoint uses TLS results in the below response. Meaning, --plaintext should be left out of the command when you have configured your route to use TLS.

    Failed to list services: server does not support the reflection API

    1. The IngressRoute configuration needed some polishing.
      I stumbled upon an unrelated stackoverflow question which lead me to the correct way to set up the configuration. I am not sure which part of the changes made it work, but the addition of namespace, scheme and passHostHeader does the trick here I believe.

    What I changed

    The new thanos-ingressroute.yaml

    apiVersion: traefik.containo.us/v1alpha1
    kind: IngressRoute
    metadata:
      name: thanos
      namespace: monitoring
    spec:
      entryPoints:
        - websecure
      routes:
        - match: Host(`thanos-grpc.domain.com`)
          kind: Rule
          services:
            - name: monitoring-thanos-discovery
              namespace: monitoring
              port: 10901
              scheme: h2c
              passHostHeader: true
      tls:
        secretName: my-domain-wildcard-tls
    

    This is the response I now get calling the configured domain.

    $ grpcurl thanos-grpc.domain.com:443 list
    
    grpc.health.v1.Health
    grpc.reflection.v1alpha.ServerReflection
    thanos.Exemplars
    thanos.Metadata
    thanos.Rules
    thanos.Store
    thanos.Targets
    thanos.info.Info
    

    NOTE THAT I AM NOT USING --plaintext FLAG ANYMORE.

    If I use the --plaintext I get the same old response Failed to list services: server does not support the reflection API.