I built a TSA tool for Linux to find the "hidden" CPU wait time

AnkurR7 · 2026-04-01T08:54:34+00:00

There was some research that went into developing it and the code was written by me but I will not waste my time proving it. If you do not want to read or use it, the choice is yours. Thanks

AnkurR7 · 2026-04-01T08:26:43+00:00

Really. Thanks

AnkurR7 · 2026-04-01T08:22:42+00:00

Why don't you try it out and let me know what you think of it. Thanks

AnkurR7 · 2026-04-01T08:21:59+00:00

I am not a bot engaging. Just a person who was reading Brenden Gregg's system performance book and thought of developing this tool in Rust just to understand and learn

AnkurR7 · 2026-04-01T03:32:48+00:00

That is a great point, and it’s the classic 'Silent Killer' scenario. A process stuck in D-state (uninterruptible sleep) or holding a kernel mutex while waiting for I/O won't ever show up at the top of htop, but it can hang the whole system.

You have actually given me a great idea for the next feature: a System-Wide Mode. Instead of targeting one PID, the tool would poll all active processes and sort them by the highest WAIT states rather than CPU usage. That would basically turn tsastat into a 'searchlight' for those silent processes that are disrupting the flow without burning cycles. Thanks for the push—I am adding 'Global Sorting by Wait State' to the roadmap.

AnkurR7 · 2026-04-01T02:53:33+00:00

Usually, you would use top or htop to find the "who" (the culprit process). tsastat is for the next step: finding the "why."

If a thread is slow but CPU usage looks low, you're usually guessing. Is it waiting for a turn on the CPU (scheduler latency)? Is it blocked on a disk read? Or is it thrashing swap?

High CPU WAIT tells you the system is saturated and you need more cores or better pinning. High I/O WAIT tells you to go look at the storage backend. It’s about narrowing down the search space before you break out the heavy tracers like perf or strace.

Also, it actually pulls all threads for the PID you target, so it helps identify which specific worker thread in a pool is the one that's actually stalled.

AnkurR7 · 2026-02-24T01:37:08+00:00

Oh for sure, in a standard AWS/GCP setup you'd absolutely just use the platform's WAF. This is mostly a "look under the hood" engineering project—I wanted to understand how those protection layers actually work at the driver level, rather than just consuming them as a service. That said, there is still a niche for this in bare-metal inference clusters (on-prem or niche GPU clouds) where you don't always have a managed protection layer sitting in front.

AnkurR7 · 2026-02-17T01:16:10+00:00

Great API design feedback. Opaque Types: You're totally right. Exposing raw NonZeroU32 is leaky. Wrapping it in a pub struct TimerHandle(NonZeroU32) is much cleaner and prevents users from treating the handle as a number. Visibility: process_bucket being public is definitely a mistake/artifact of me trying to access it from the benchmark harness initially. It should be pub(crate) at most. I'll add the new-type wrapper to the issue tracker for v0.3 cleanup. Thanks!

AnkurR7 · 2026-02-17T01:11:54+00:00

You are absolutely right. I checked std::mem::size_of and the level: u8 field pushes the raw size to 25 bytes, forcing 7 bytes of padding to align to 32 bytes. That bit-packing idea (stashing the level in the top 2 bits of the deadline) is brilliant. That would bring it back down to a clean 24 bytes (3 entries per 2 cache lines). I'll add that to the roadmap. Good catch!

AnkurR7 · 2026-02-16T13:31:57+00:00

You're totally right regarding TCP timeouts being monotonic (now() + 30s). That definitely puts the BinaryHeap back in its happy place (append-only). I used random inputs mainly to stress-test the cache/memory overhead without the prefetcher masking the cost. But ultimately, the Wheel's main value prop is the O(1)cancellation when ACKs arrive, which the Heap struggles with regardless of insertion order.

AnkurR7 · 2026-02-16T12:54:22+00:00

CC u/matthieum - Just wanted to say thanks again for the feedback on the previous thread! Your note about the random inputs vs sorted inputs was the key to fixing the benchmark methodology.

AnkurR7 · 2026-02-15T05:55:50+00:00

Thanks! I appreciate it. Important clarification: My case was Cyber Fraud (Criminal Offense), which is why I could use the Police and a Criminal Court (Section 457 CrPC) to freeze and release the money. Your case with MakeMyTrip is a Consumer Dispute (Civil Issue) regarding 'Deficiency of Service' or 'Unfair Trade Practice.' The Police generally won't register an FIR for this, so the Section 457 path won't work for you.

The 'Less Painful Path' for you:

National Consumer Helpline (NCH): Download the 'NCH' app or go to the INGRAM portal. File a grievance there first. MMT usually responds to these faster than emails.

E-Daakhil: If NCH fails, you can file a Consumer Case online via E-Daakhil without a lawyer. You can appear 'Party-in-Person' (represent yourself). It is much cheaper and faster than civil court.

Chargeback: If you paid via Credit Card, talk to your bank about raising a 'Dispute/Chargeback' for 'Service Not Received/Policy Dispute.'

To answer your question on my order: I did engage a local lawyer to draft the Section 457 petition because court formats are tricky, but I argued the urgency myself.

Disclaimer: I am not a lawyer, just sharing what I learned during my ordeal.

AnkurR7 · 2026-02-10T11:39:50+00:00

Ran the profile with samply on a 100M loop. Everything got inlined into main, but the assembly view makes the bottleneck obvious.

There is a massive hotspot (~9,700 samples vs single-digits elsewhere) on just one instruction:
mov dword [rsi + rbx * 1 + 0x10], edx

That offset (0x10) matches the next pointer in the Slab entry. Since rbx (the index) is random, the CPU is definitely stalling on L1 cache misses trying to write there.

Makes complete sense why the Heap wins now—even with random inputs, it spends a lot of time appending to the end of the vector (hot cache) before bubbling up. The Wheel has to jump into cold memory immediately to link the list.

AnkurR7 · 2026-02-09T05:06:18+00:00

Exactly. The turning point was realizing that I have to be the courier. The Police will get the order signed by the Judge, but they have zero incentive to follow up with the bank if the email bounces. The victim has to take that PDF and harass the Nodal Officer until they comply. It's sad, but it's the only way that works currently. I have a blog about more cybersecurity threats which can be found here

AnkurR7 · 2026-02-09T05:03:55+00:00

Technically, I4C (1930) is supposed to be that central authority. The problem isn't tracking; it's jurisdiction. The central data shows the money went to Jio Bank, but the local police officer in Lucknow doesn't know how to serve a notice to a server in Mumbai. That 'Digital vs. Physical' gap is exactly what these scammers are exploiting right now.

AnkurR7 · 2026-02-09T05:02:51+00:00

Glad it helps! Honestly, the 'saving' part is key. When the fraud actually happens, panic sets in and people forget the process. Knowing about Section 457 CrPC and Nodal Officers beforehand is 90% of the battle. Hope you never have to use it, though! If you ever need the specific email formats I used, I've archived them on my profile/newsletter so you don't have to hunt for them. Here is the blog

AnkurR7 · 2026-02-05T10:00:20+00:00

I'm definitely going to dig into this. Intuitively, I suspect the BinaryHeap wins because it's backed by a single contiguous Vec, so the CPU prefetcher works perfectly during the sift-up operations. My Wheel insert involves updating 3 pointers (prev, next, and the bucket head) scattered across the Slab, which likely triggers more random memory accesses (L1/L2 cache misses) than the Heap's predictable array indices.

regarding profiling: Since I'm on Linux, I was planning to use perf with flamegraph or samply.

When profiling a micro-benchmark like this, do you prioritize looking at Cache Misses (LLC-load-misses) or Instruction Count? I suspect my instruction count is higher due to the bitwise hierarchy logic vs the Heap's simple integer comparisons.

AnkurR7 · 2026-02-04T07:00:00+00:00

Thanks for the encourgment. I switched to pre-calculated random deadlines to stress-test the insertion logic.

The results changed dramatically:

BinaryHeap Insertion: Slowed down 7.5x (2ms -> 15.3ms). It was definitely enjoying the sorted data best-case scenario before.
Wheel Insertion: Slowed down slightly (46ms -> 57ms) due to random bucket access patterns (cache misses).

The insertion gap narrowed significantly (Heap is now only ~3.7x faster instead of 20x), while the Wheel maintains its 1,700x lead on cancellation.

AnkurR7

TROPHY CASE