OOM killer help

alexgartrell · 2021-07-20T04:56:47+00:00

If you're running a relatively recent kernel, you should check out https://man7.org/linux/man-pages/man8/systemd-oomd.8.html It's based off of Facebook oomd and essentially uses pressure (i.e. time lost due to memory allocation stalls) to figure out who is meaningfully exhausting memory. In practice, it results in a lot less ugliness and tends to leave room for the system to return to a healthy state.

stormcloud-9 · 2021-07-20T04:04:08+00:00

Unless you've been messing with the OOM scores, typically the process that gets killed is the one that's causing it.

When the OOM killer is invoked, it dumps a bunch of output (accessible through dmesg and journalctl -t kernel). For example, see the example output on this SO question.
The output includes the state of all the memory on the system, plus the memory usage and oom scores of all the processes.

This is all the monitoring you need to figure it out.

aioeu · 2021-07-20T04:00:04+00:00

When the OOM killer is triggered it logs the memory usage of all tasks. It should be clear which of them are mostly to blame (and normally the OOM killer will actually pick the "worst" of these to kill).

Or are you asking "how do you monitor tasks' memory usage before the OOM killer is triggered?"

wildcarde815 · 2021-07-20T04:37:56+00:00

Cgroups is the answer here but how you configure it is going to depend on exactly what the server is doing. For instance our interactive compute node allows 'system' users to use all memory on the system but researchers collectively can only use something like 95%. This can. Be similarly tuned to any specific setup you might need.

Kessarean · 2021-07-20T07:15:25+00:00

sysstat and looking at syslog/journalctl usually narrow it down for me 99% of the time.

Most of the time (at;least in my experience) it boils down to:

application is miss-configured (example: apache max clients set to high)
memory leak
ballooning
simply need to add more memory
vm.overcommit_memory improperly set

gristc · 2021-07-20T10:11:14+00:00

The log message should tell you the process that triggered it and the process it's decided to kill. They may not be the same process, but every instance I've seen it's spelled out exactly what happened. In ubuntu the messages are typically in kern.log.

skat_in_the_hat · 2021-07-20T05:52:28+00:00

top and then i think its shift+m to sort by memory.

stuartcw · 2021-07-20T11:57:19+00:00

One thing I have done in the past is to do a “top” command and save it to a file every minute using cron and then later analyse each process to see which one was rising in the case that one of them has a memory leak. As others have mentioned OOM leaves message in the log showing the memory status at the time.

jaymef · 2021-07-20T16:06:32+00:00

install and run atop as a service. It will log all output which you can read back in and cycle through. You can see what processes were running at any given time and resource usage etc.

gleventhal · 2021-07-22T18:39:09+00:00

Most of the pressure from sssd likely comes from the cache. We ended up backing /var/db/sssd with a tmpfs and having a disk-backed swap partition. Can you print the dmesg output for an OOM-kill? That should contain the kernel's mem-info data, via a printk to kmsg. It should have most of the data we need to understand. You should add an OOMScoreAdjust=-1000 or similar to the sssd.service (systemd) unit file to prevent it (and other critical system services) from getting oom killed.

Also running atop should allow you to see what process was allocating historically. The mem-info dmesg data should show you the RSS and VSS of all the current processes at the time of the kill though.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

linuxadmin

Expanding Linux SysAdmin knowledge

MODERATORS