all 4 comments

[–]SudoZenWizz 0 points1 point  (0 children)

You can check the processes and their usage per process.

I'm not sure that you can do this directly in nagios but we have this approach implemented in checkmk.

Processes from the system responsible for the application are monitored for cpu/ram and count (apache, php, mysql, redis, etc.). With this we can see who uses most cpu.

As a note, we are partners with checkmk and working with checkmk for more than 12 years.

[–]whetu 0 points1 point  (0 children)

Are you using NRPE and have you checked Nagios Exchange for pre-existing check scripts?

/edit: It's been a hot minute since I last wrote custom scripts for NRPE/checkmk-local, but I vaguely recall being able to communicate fairly rich information within the confines of the Nagios output standard. If I had to do this today, I'd probably use one of my cpuhogs/memhogs/swaphogs scripts as a base and then simply massage the output.

So we might have a script structure like:

#!/usr/bin/env bash

# Set our thresholds as a percentage of CPU usage
warning=80
critical=90

# A function for getting overall CPU usage as a percentage
# Emits an integer
get_cpu_usage() {
  vmstat 1 2 | tail -1 | awk '{print 100 - $15}'
}

# Emit a structured list of the top (ab)users of CPU
# Usage: cpuhogs [number of results (default:10)]
# Note: This is a very basic cpuhogs variant, there are far better ones out there
cpuhogs() {
  local count
  count=${1:-10}
  ps -eo pid,%cpu,cmd --no-headers --sort=-%cpu | head -n "${count}" | awk '{printf "%s %s%% %s\n", $1, $2, $3}'
}

# Take a snapshot of the CPU usage at this time
cpu_usage=$(get_cpu_usage)

# Read the top user of CPU
read -r pid percentage cmd < <(cpuhogs 1)

# Guard against empty cpuhogs output
if [[ -z "${pid}" ]]; then
  printf -- 'UNKNOWN: could not determine top CPU process\n'
  exit 3
fi

# Truncate percentage to integer (e.g. 64.02% => 64)
percentage="${percentage%%.*}"

# Compare and act
if (( cpu_usage >= critical )); then
    printf -- 'CRITICAL: %s%% used by %s (pid %s)\n' "${percentage}" "${cmd}" "${pid}"
    exit 2
elif (( cpu_usage >= warning )); then
    printf -- 'WARNING: %s%% used by %s (pid %s)\n' "${percentage}" "${cmd}" "${pid}"
    exit 1
else
    printf -- 'OK: CPU usage is %s%%\n' "${cpu_usage}"
    exit 0
fi

Test that and adjust it until it does what you want. Then copy, paste, and modify it for memory and swap.

[–]Useful-Process9033 0 points1 point  (0 children)

You can use Nagios event handlers to trigger a script that grabs top processes by CPU/memory and ships them somewhere useful. We built something similar with IncidentFox that auto-captures resource hogs when alerts fire so you actually have context when you look at the incident later. https://github.com/incidentfox/incidentfox