google-cloud-platformtime-seriesmetricsgoogle-cloud-monitoringmonitoring-query-language

Combining two metrics with different resource types into one using MQL


How can I combine two metrics into one, when they come from two separate resources?


I have a simple logs-based metrics, my-metric, defined like this:

resource "google_logging_metric" "my_metric" {
  name = "my-metric"

  filter = <<-EOT
             logName="projects/my-project/logs/my-app"
             labels.name="my.event"
           EOT

  label_extractors = {
    "event_type" = "EXTRACT(labels.event_type)"
  }

  metric_descriptor {
    value_type  = "INT64"
    metric_kind = "DELTA"

    labels {
      key        = "event_type"
      value_type = "STRING"
    }
  }
}

I recently moved my app to Google Cloud Run (GCR) which has its own logs, so I updated the metric's filter like this:

(
  logName="projects/my-project/logs/my_app"
  OR
  logName="projects/my-project/logs/run.googleapis.com%2Fstdout"
)
labels.name="my.event"

What I didn't expect is that the metric becomes attached to a different resource, so logically I have two metrics. In MQL:

  1. gce_instance::logging.googleapis.com/user/my-metric
  2. global::logging.googleapis.com/user/my-metric

I want to keep my existing alerting policies that are based on this metric, so I'm wondering if there's a way to either combine the metrics from the global and GCE instance resources into one metric (I would group by event_type and add them up, for example).


Trial and error

I have tried to just get them merged into one graph in the metrics explorer.

1. Denial

I have almost exclusively used a single log and the global resource before, so my intuition was to simply do this:

fetch global::logging.googleapis.com/user/my-metric

This would only get me half of the values though. I realized I'd get the other half like this:

fetch gce_instance::logging.googleapis.com/user/my-metric

2. Anger

Ok, let's just combine them. I know enough MQL to be a danger to myself and others (or so I thought).

{
    fetch global::logging.googleapis.com/user/my-metric
    ;
    fetch gce_instance::logging.googleapis.com/user/my-metric
}
| outer_join 0
| add

That only shows the global resource. It happens to be the first, so my intuition is to swap them around, sometimes that gives more information (I find the MQL reference very abstract, and I have mostly learned by copy-pasting examples and trial-and-error). Putting the gce_instance first throws two errors instead:

Line 8: Input table 1 does not have time series identifier column 'resource.instance_id' that is present in table 0. Table 0 must be a subset of the time series identifier columns of table 1. Line 8: Input table 1 does not have time series identifier column 'resource.zone' that is present in table 0. Table 0 must be a subset of the time series identifier columns of table 1.

I don't really need instance_id or zone, so perhaps I can just remove them?

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| outer_join 0
| add

And now it's only the gce_instance resource. For reference, here's what it looks like:

Only the global resource: enter image description here

Only the gce_instance resource: enter image description here

What I would like (kind of): enter image description here

3. Bargaining

join

I'm sure MQL is beautiful once you fully grasp it, but to me it's still a black box. Here's a few other attempts. I basically went through the MQL reference, trying every keyword I could find:

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| join

No data is available for the selected time frame

Don't know what that means. Next!

join and group_by

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| group_by [metric.event_type], max(val())
| join

No data is available for the selected time frame

Useles... NEXT!

union_group_by

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| union_group_by [metric.event_type]

Chart definition invalid. INVALID_ARGUMENT: Request contains an invalid argument.

That's very helpful, thanks. NEXT!

outer_join or_else

The outer_join in my first attempt at least seemed to give two tables with values. Maybe I just need to combine them?

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| outer_join 0
| or_else

Very interesting. I now get a bunch of different time series, grouped by event_type. They are all flatlining at 0 though. Changing to outer_join 123? Yes, they are now all constantly 123 instead.

The outer_join docs have this to say:

One or both of the left_default_value and right_default_value arguments must be given. Each corresponds to one input table (the first, "left", table or the second "right" table) and, when given for a table, that table will have rows created if it does not have some row that matches a row in the other table. Each argument specifies the value columns of the created row. If a default argument is given for a table, then the time series identifier columns in that table must be a subset of the time series of those of the other table and it can only have Delta time-series kind if the other table has Delta time-series kind.

I found this part vaguely interesting:

the time series identifier columns in that table must be a subset of the time series of those of the other table

Not sure what my time series identifier columns are. Perhaps they're just bad, but I'm not about to give up. What if they're not a subset? Perhaps I need to align, not aggregate? Did I mention that I don't know what I'm doing?

Aligning

Aligning functions are use [not my typo] by the align table operation to produce an aligned table, one whose time series have points with timestamps at regular intervals.

I guess I need to invoke the align table operation with one of the aligning functions? Regular intervals sounds cool.

The aggregation docs has a section about aligners as well

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    | map drop [resource.zone, resource.instance_id]
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| align interpolate(10m)
# | group_by [metric.event_type], sum(val())
| outer_join 0
| add

Interpolation doesn't give me the missing data. This one gives me the global resource, but with a bit of interpolation where it doesn't have any data. This feels like a dead end as well.

I threw in a group_by as well just in case, no change.

4. Depression

I'm starting to get slightly frustrated now, I have data in two tables, but no matter what I do I can only see the data in one of them. I've combined time series in various ways with MQL before and once it works I can usually explain why. It gets tricker when it doesn't work.

Perhaps we can get back to first principles somehow? I know group_by [] clears the labels, maybe that would simplify things?

{
    fetch gce_instance::logging.googleapis.com/user/my-metric
    ;
    fetch global::logging.googleapis.com/user/my-metric
}
| group_by []

Line 1: Expect query to have 1 result but had 2.

Ouch. Adding a | union at the end?

Line 7: Input table 0 has legacy target schema 'cloud.CloudTask' which is different from input table 1's legacy target schema 'cloud.Global'. The inputs to the 'union' table operation are required to have the same column names, column types, and target schemas.

That's a new one! "Target schema" huh? Perhaps that's been the issue all along?

Let's consult the trusty reference! Schema... schema? No mentions about schemas.

The examples perhaps? No, but it says "before you begin". I've read it before, but perhaps I missed something?

Some familiarity with Cloud Monitoring concepts including metric types, monitored-resource types, and time series is helpful. For an introduction to these concepts, see Metrics, time series, and resources.

But no, the "Metrics, time series, and resources" page doesn't mention legacy target schemas either, or even schemas in general. Neither does the Components of the metric model or the Notes on terminology pages.

Am I at another dead end? A quick Google search seems to indicate that it is.

Other attempts

5. Acceptance

I have tried everything I can think of and read through most of the documentation a few times.

Writing this question, I found [an exciting answer](https://stackoverflow.com/a/67098846/98057] and tried with my metrics:

{
    fetch gce_instance
    | metric 'logging.googleapis.com/user/my-metric'
    | group_by [], sum(val())
    | align rate(1m)
    | every 1m
    ;
    fetch global
    | metric 'logging.googleapis.com/user/my-metric'
    | group_by [], sum(val())
    | align rate(1m)
    | every 1m
}
| join
| add

No data is available for the selected time frame

I have of course verified that at least one of the "subqueries" returns some data, in this case it's this one:

fetch gce_instance
| metric 'logging.googleapis.com/user/my-metric'
| group_by [], sum(val())
| align rate(1m)
| every 1m

How can I combine these two metrics from two separate resource types into one using MQL?


Solution

  • Here's the solution from GCP's support:

    {
        fetch gce_instance
        | metric 'logging.googleapis.com/user/my-metric'
        | group_by [], sum(val())
        | align rate(1m)
        | every 1m
        ;
        fetch global
        | metric 'logging.googleapis.com/user/my-metric'
        | group_by [], sum(val())
        | align rate(1m)
        | every 1m
    }
    | outer_join 0,0
    | add
    

    I tried both outer_join(0,0) (syntax error) and outer_join 0, but outer_join 0,0 did what it's supposed to - adding a default value to both tables. Obvious, once you see it.