prometheusopen-telemetryotel

prometheus - ingesting otel metrics drops histograms


I am collecting OpenTelemetry (OTEL) metrics from my services, forwarding them to an OTEL Collector, and then sending them to Prometheus through the OTEL endpoint. Here is a sample OTEL Collector configuration I am using:

# Receivers
  receivers:
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318
          auth:
            authenticator: bearertokenauth

  # Processors
  processors:
    batch: {}

  # Exporters
  exporters:
    otlphttp/logs:
      endpoint: "http://loki-write:3100/otlp"
    otlphttp/metrics:
      endpoint: "http://prometheus-server/api/v1/otlp"
    otlphttp/traces:
      endpoint: "http://tempo-distributor:4318" 

    # Pipelines
  service:
    extensions:
      - health_check
      - bearertokenauth
    pipelines:
      logs:
        receivers: [otlp]
        processors: [batch]
        exporters: [otlphttp/logs]
      metrics:
        receivers: [otlp]
        processors: [batch]
        exporters: [otlphttp/metrics]
      traces:
        receivers: [otlp]
        processors: [batch]
        exporters: [otlphttp/traces]

Problem:

I can view most metrics in Prometheus (via Grafana), but histogram metrics are missing. For example, metrics like http_server_request_duration_sum do not appear in Prometheus.

Troubleshooting Steps:

  1. To rule out issues with the OTEL pipeline, I tested sending the same metrics to a VictoriaMetrics backend. In VictoriaMetrics, I can see all metrics, including histograms (e.g., http_server_request_duration_sum). The only difference is that . in metric names is not replaced with _.
  2. I enabled native histograms in Prometheus by modifying the deployment configuration, but this did not resolve the issue.
  3. I am using the latest version of Prometheus deployed via a Helm chart in an EKS cluster.

Additional Context:

Question:

What could be the reason for histogram metrics not appearing in Prometheus? Are there any specific configurations in the OTEL Collector, Prometheus, or Grafana that I need to check?


Solution

  • otel client ( dotnet client as well ) doesn’t use exponential histograms so first of all drop native histogram feature flag from prometheus configuration. https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/docs/metrics/customizing-the-sdk/README.md#configuring-the-aggregation-of-a-histogram

    otel client sends explicit bucket histogram which can only be converted to classical prometheus histograms https://www.prometheus.io/docs/specs/native_histograms/#otlp

    OTEL_METRIC_EXPORT_INTERVAL in otel client sdk is 60 sec by default while as garafna expects it to be 15 sec. We have two choices either set it to 15 sec in otel client as mentioned here https://prometheus.io/docs/guides/opentelemetry/ or change scrape interval to 60 sec in grafana as mentioned here https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/