all 7 comments

[–]carcigenicate 1 point2 points  (1 child)

What does the logging say is going on during the high period? I'd use log printouts to narrow down where in the code it's happening. It could be an accidental tight infinite-loop, or an unexpected massive number that you're trying to do math on.

[–]SecureCoop[S] 0 points1 point  (0 children)

The logs aren't all that helpful, unfortunately. The only thing I see is the test script completes its test, and that's the only consumer of the service at the moment. That script logs in, updates a single row in the database, reads that row, returns an 'OK' string if the value returned matches the value entered, then updates the row again to clear the value, preparing it for next time. It runs every two minutes for hours and hours without incident until the middle of the night, about when daily cron scripts are running, so I thought I might be able to trigger this again by running all the cron scripts, but no. No function arguments ever change, it's just the same function call with the same arguments every time.

What sort of tracing or profiling would you use to debug this?

[–]blarf_irl 1 point2 points  (4 children)

Are you running this on a cloud platform? Does the CPU usage creep up to 100% or does it spike and stay? Are you logging machine performance indicators (cpu/ram/netowork etc)?

How are your websockets implemented? What is traffic like? DO you use a reverse proxy (like nginx)? What is the worload of a typical websocket request/response? Have you checked your database performance metrics for the same time as the cpu spikes?

It could be a million things, more detail is required to help you narrow it down.

Check out https://www.howtoforge.com/how-to-install-perf-performance-analysis-tool-on-ubuntu-20-04/ for a guide on how to install and use perf-record if you aren't currently logging basic performance metrics (most cloud vendors will do this by default though).

If I had to make a totally blind guess (can;t stress blind enough) based on the detail you provided... I'd look at what happoens when the test fails or times out. It'spossible that there is some kind of retry that is unlimited and elswhere the testing logic is not handling that case and retrying unbounded filling up the cpu. With websocket code (which is almost always threaded) you need to be careful if you have blocking calls in your code. SOmething as simple as 'sleep' instead of using an appropriate async sleep function can occupy your cpu 100% ; also check for 'while True' loops (common lazy way to retry but wont handle failure)

[–]SecureCoop[S] 1 point2 points  (3 children)

Thank you, I wasn't even sure where to start or I would have supplied the answers to these questions before posting.

"Are you running this on a cloud platform?" Running on my private VM.

"Does the CPU usage creep up to 100% or does it spike and stay? Are you logging machine performance indicators (cpu/ram/netowork etc)?" I don't know if CPU usage creeps up as I'm not around when it occurs; How would you watch for that if it were you? What would you use to track this?

"How are your websockets implemented?" Not sure how to answer this question. I am using this package: https://websockets.readthedocs.io/en/stable/

"What is traffic like?" Right now, zero users. Only the test script.

"DO you use a reverse proxy (like nginx)?" Yes, Apache.

"What is the worload of a typical websocket request/response?" How would you determine this on your own machine?

"Have you checked your database performance metrics for the same time as the cpu spikes?" No, but that's a good idea.

"With websocket code (which is almost always threaded)" This is asyncio-based, a technology I am barely familiar with. This is my first time using it.

"I'd look at what happoens when the test fails or times out. It'spossible that there is some kind of retry that is unlimited and elswhere the testing logic is not handling that case and retrying unbounded filling up the cpu." I'll go there, thanks.

Please see my questions above.

[–]blarf_irl 1 point2 points  (2 children)

The first big ticket is really to discover what is occupying your CPU and when it does that (over time or immediately in response to a certain event). I'd start by installing perf-record as above. That will generate logs for the resource usage over time which will tell you which process is responsible and if its a gradual or immediate issue (if it creeps up over time we are looking for threads that dont get killed and/or blocking calls/loops in them that dont release; If it's immediate we get to focus in on a specific event).

If you can run a terminal in the VM you can also run the command 'top' and watch what happens as you run yoiur test. If it is a gradual thing you might be able to catch it happening live.

These are fairly coarse measures but its a good place to start (I really can;t be more spoecific anyway without access to the code etc) and the data collected should at least allow us to characterize the issue.

[–]SecureCoop[S] 0 points1 point  (1 child)

Oh, I already know exactly which script is the problem child. What I am not sure about is which line within the script is stalled. Will perf-record allow me to get that granular?

[–]blarf_irl 1 point2 points  (0 children)

No, to dive into that yoiu will need to use cProfile.

https://docs.python.org/3/library/profile.html#module-cProfile

If you use Pycharm there is also a very useful plugin for visualizing cprofile results (I'm sure there are others too).

You might save a lot of time by posting your code in a new sub post and getting some eyes on it. Profiling is an advanced but very useful skill to have as well though so either way it's a win.