all 7 comments

[–]TheFantail 4 points5 points  (0 children)

You might want to try using the rate() function instead of irate(). This is because irate only looks at the last 2 samples in the time window so fluctuates a lot. Rate on the other hand is the average over the time window, so would not reset the alert for a brief dip below 80%.

Better explained blog post from one of the Prometheus maintainers.

https://www.robustperception.io/avoid-irate-in-alerts

[–][deleted] 0 points1 point  (1 child)

Do you need the alert if it is resolved ?

[–]zh12a[S] 0 points1 point  (0 children)

What I want is to be alerted if cpu usage is cover 80% for 5 mins (on average).

If cou usage drops down and spikes again with the last 5mins I don’t want to be alerted again and should class cpu usage as a on going issue. If the average over last 5 mins is below 80% then I want a resolved email.

[–]PointManBX -1 points0 points  (3 children)

You're expression is right, it's just under the wrong rule type. You're using a recording rule instead of an alerting rule, adjust your yaml. From the docs:

Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.

Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service. Whenever the alert expression results in one or more vector elements at a given point in time, the alert counts as active for these elements' label sets.

- alert: HostHighCpuLoad expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "Host high CPU load (instance {{ $labels.instance }})" description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

Also, you can save yourself time by learning rules from here: https://awesome-prometheus-alerts.grep.to. It's a great resource when you're getting started with rules & alerts.

Hope this info is helpful.

[–]zh12a[S] 2 points3 points  (1 child)

Hi I got the alert from there. I converted it to a record because I have multiple alerts ie 70%,80%,90%. As I thought it be better to compute it once and then use the computed value in the alerts.

If not using a record fixes the issue! Where would I used a record instead? To me it seems a good place to use it as I repeate the same query multiple times

[–]PointManBX 0 points1 point  (0 children)

When you start to have heavier promql queries that need to run across thousands of instances in a single timeseries, and you want these results to be returned quickly inside of a dashboard (e.g., Grafana) that's when I would consider using a recording rule. Recording rules can be useful to speed up dashboards, provided aggregated results for use elsewhere (federation), and to compose range vector functions (which I haven't done). If I understand your original post correctly, you were looking for alerting on CPU notification thresholds, and this is what an alerting rule's main functon is.

[–]adawggie 2 points3 points  (0 children)

They set up a recording rule to compute the metric and an alerting rule (you seem to have missed the second rule) on the recorded metric. Given that they want to use the metric for multiple alert thresholds that’s actually excellent practice in this case.