I have a java service instrumented with the open telemetry java-agent.
The java-agent is sending the traces and metrics to an open telemetry collector.
And the open telemetry collector is sending the data to Datadog.
The java service is having a counter metric with a considerable high cardinality (~800).
Some combinations of tags are getting "count increment" only few times a day. This said, I suspect these combinations still sending a count value of 0 every x seconds. Keeping the "cardinality per hour" of the metric higher than required.
I found this interesting message on the internet related to Statsd:
It's the interaction of two separate parts - flush interval and metric expiry.
The flush interval is how often metrics are aggregated and sent upstream, and is 1 second by default. The expiry interval is how long a metric needs to receive no data before it stops being sent upstream.
If you were to not send and metrics for 30 seconds, you would see your last value, 29 zeros, and then it would stop sending.
source: https://github.com/atlassian/gostatsd/issues/296#issuecomment-595669574
I was wondering if this "metric expiry" configuration/mechanism was available with open telemetry? I tried to find a reference of it in the doc, but did not succeeded.
Thank you
for info, I got great support from Datadog and we ended up solving this.
Hypothesis confirmed
The two hypothesis described in the original question were confirmed by Datadog:
Solution
By default, otel-sdk is configured to produce metrics with a cumulative temporality, but apparently, Datadog works best with delta temporality.
The solution is to simply change the temporality to delta by defining this environment variable: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE to DELTA
Reference: Producing Delta Temporality Metrics with OpenTelemetry