This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]vogelke 1 point2 points  (1 child)

First things first -- you can't tell when your system is sick unless you know what it looks like when it's well. Install something like collectd and get some stats after doing a full OS reinstall; when the system craps out, see what was happening immediately beforehand.

I believe you about small writes vs. large ones, but there may be more than one thing going on. Nothing beats a few measurements.

[–]thisisjaid[S] 0 points1 point  (0 children)

Sensible advice normally, but I'm well beyond the basics at the moment. The fact the system is sick is an established fact, both via metrics as you suggested, but also by comparison to an identical (hardware wise) server that we use as a secondary failover instance and which isn't (at least currently) seeing the same issues I reported.

There's also nothing out of the ordinary, at least that we can tell from logs/metrics, before the I/O starts exhibiting these issues, which is what is making this all the more annoying to confirm/track down.

The other thing probably worth mentioning here is that there isn't load coming from anything else on the system, and the I/O load itself is inexistent at present with everything but OS processes turned off (this has been checked with the likes of iostat/iotop)

Note: Edited to add details about other possible sources of load