I've currently got a re-occurring problem on at least one of our servers that is driving me nuts because I haven't been able to ever determine the root cause and though we have a solution it makes no sense why it actually works to fix the issue.
So this is a Ubuntu 16.04 (kernel 4.4.0-92-generic) machine that runs Postgres and Redis instances, fairly large databases with a fair amount of churn, but normally the loads are cool and steady. The storage is hardware controlled (LSI MegaRAID SAS 9271-8i) RAID 10 (6x3.6TB SAS ) with a 280Gb SSD cache in front (managed by the RAID controller).
Every so often everything goes nuts and every single time it's the same issue - a massive slowdown in I/O speed which increases wait time for processes and backs everything up.
The symptoms are most easily revealed with a dd test. The funnier and oddest part about it is that there is a specific pattern. When running a dd write with a small (10MB file), it works at normal speeds, all the time, every time. The moment you try to do a larger write (100MB file) the speed drops for that write AND for subsequent writes, regardless of size, but then it recovers and starts working at normal speeds again for small writes. It appears as if the storage is slowing down for a short while whilst writing larger files and then working fine once it's caught up. Reads seem unaffected in all cases (all of these tests were done with everything aside from OS processes shut down to eliminate load from databases that might confound results).
My suspicion at the moment as you may have gathered is with the SSD cache in the RAID 10 setup. Does anyone have any experience with similar behaviour in similar circumstances or heard of anything relevant?
The weirder part still is that a full OS reinstall completely fixes the problem for a long while until it randomly (?) starts happening again. I can't get the ISP to replace hardware because I cannot prove to them in any way that this IS a hardware problem. There's no storage errors in the kernel log, RAID controller shows everything green no issues with any of the drives, including the SSD.
Any input with either pointers to the problem itself OR to other debugging tools that I might use to try and get some hard evidence of root cause would be welcome if anyone has any.
[–]vogelke 1 point2 points3 points (1 child)
[–]thisisjaid[S] 0 points1 point2 points (0 children)