[SOLVED] What Envoy metric should be used to measure server latency?

What Envoy metric should be used to measure server latency?

My server is a service deployed to service mesh implemented by Envoy and Istio sidecars. I only have access to Envoy metrics. The HTTP server receives requests from clients external to the mesh as in the diagram below:

I want to measure the average latency it takes for the server to respond to the external client and I'm having trouble understanding whether I need upstream or downstream metrics and then whether external/internal metrics are needed. According to Envoy docs:

Downstream: A downstream host connects to Envoy, sends requests, and receives responses.

Upstream: An upstream host receives connections and requests from Envoy and returns responses.

In this scenario as far as I understand:

Ingress Gateway acts as upstream for the external client and downstream for the Envoy sidecar.
Envoy sidecar acts as upstream for Ingress Gateway and downstream for HTTP server.

As far as I understand both Ingress Gateway and Envoy sidecar publish downstream and upstream metrics. How do I get the total latency: from the point a request reached Ingress Gateway to the point Ingress Gateway returned last byte of the response?

Solution

There doesn't seem to be a noticeable difference between the two, plus if you want to configure per route metrics you can achieve something like that using virtual clusters but it will only give you upstream metrics.

From https://github.com/envoyproxy/envoy/issues/10967

Assuming there are no blocking filters in use (e.g. ext_auth) I expect the times to be reasonably close together.

About per route metrics see https://github.com/envoyproxy/envoy/issues/23642