google-cloud-functionspolicystackdriveralerts

Set stackdriver alerts for specific error messages


Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions

I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:

  1. Technical exceptions which might cause function to 'crash'
  2. Custom errors that we are logging from the cloud function

I have done the below,

This is done as per the answer to the question, how to create alert per error in stackdriver

For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.

Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition: 
is above, Threshold: 0.001, For: 1 min

So I have 3 questions,

  1. Is this the right way to do to satisfy my requirement of creating alerts?

  2. How can I still receive alert notifications for subsequent errors?

  3. How to set the incident to 'resolved' either automatically/ manually?


Solution

  • Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).

    This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.

    Aaron Sher, Stackdriver engineer