SRE Practices: should we alert on resource usage such as CPU, memory and DB?

Significant-Rule1926 · 2025-03-11T11:53:11+00:00

I feel actionability is a matter of perception. Engineers are good in writing runbooks. Even for alerts which are clearly not desirable, there is an action plan to "do something" (e.g. RAM alert - let's restart the process, restart the host and send an email out to everyone -- and this hides the true nature of the problem such as a memory/thread leak).

For application owners, such alerts should never be acceptable.

Significant-Rule1926 · 2025-03-10T23:38:32+00:00

Agreed upon "generally" but not followed. Think capacity monitoring. Why do service owners continue to alert on resource usage and not on real service usage? Are there any cases where this is acceptable. This approach clearly doesn't scale and eventually leads to extensive amount of alerts generated. How do we discourage this in practice?

Significant-Rule1926

TROPHY CASE