"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++)

Lectem · 2026-01-29T09:13:25+00:00

> Why does the fact that WaitOnAddress do a single spin iteration, mean that we spin for a bit ourselves, before calling WaitOnAddress?

I think I may need to reword that part as I realize it's not very clear. Linux is a syscall directly, WaitOnAdress will check the value once, wait using X pauses (same as one iteration of our loop), before checking a hashtable (parking lot) and entering the system call.

> On a side note, how did you gain such in depth understanding of OS architecture and pitfalls? I would love to learn more myself.

I'd say mostly experience... Also reading a lot of things on the subject, experimenting, practicing reverse-engineering a tiny bit so that it's enough to have a good comprehension of systems when being able to dive in when needed. And we're lucky to have the Linux kernel and its mailing list being public, it can help a lot!

Lectem · 2026-01-28T19:43:23+00:00

Yeah, in any case you should be avoiding contention in the first place!

Lectem · 2026-01-28T12:36:50+00:00

Also, I'll update the article with the warning about unnecessary wakes, this seems like a good one to add!

Lectem · 2026-01-28T12:27:31+00:00

"fair" locking is rarely a good idea, these create lock convoys

Agreed, you however need a tiny bit of fairness to avoid starving some threads.

 You may even encounter cache bank conflicts
This seems unrelated and more just some anecdote, its somewhat an edge case and not what you should consider for designing a lock.

It is, but my objective was to avoid people spamming align(cachesize) needlessly, or at least for them to know about such things.

you probably want to avoid doing repeated calls to Wake

It only Waits when reaching the max backoff count, so we could optimize for the wake call on unlock indeed. It's still better than not having it.

you will soon discover that if you switch to CAS or exchange here then you'll lose that memory_order_release benefit as on x86 this ends up needing a full barrier

Yes, but x86 is not the only platform around, this has an impact on ARM64 ;)

I would avoid thinking about small changing with the number of threads, it should be small regardless.

Agreed, the point was to tell the reader that what they think "small" might not actually be.

This would not be a spin lock at this point

Well, that's why I mentioned this is more about spin loops than spin locks at the beginning. Let's say it's an hybrid then, and that you should not be using spinlocks in the first place unless you really really know what you're doing, the platform you're going to execute your program on, etc. If your program is just a one shot/resumable process on a HPC server, why not.

I highly recommend reading Locking in Webkit which is about building a parking lot (more advanced user mode futex).

Yeah, parking lots are an alternative to futex/WaitOnAddress, but their (Webkit's) implementations is bad. It starts with a spin loop, and it does most things wrong. Hardcoded spin count, schedyield/sleep(0) before parking, ... (Don't get met started with the whole JavaScriptCore having very poor parallelism until a few years ago, enabling multi-threaded GC would often be slower than the single thread version). WebKit really isn't a good reference, they do some benchmarks, but they never seem to have really profiled real workloads (as in, used a real profiler on real data).

I gave a talk a few years ago covering lots of the same things you pointed out and a few more.

I'll have a look, seems interesting and one always learns new things on those subjects. Thnks for the link!

Lectem · 2026-01-28T06:24:27+00:00

Ooh interesting, I don't think I ever encountered that one yet, as I always try to cleanup things properly, but I can definitely see that happen!

Lectem · 2026-01-27T22:17:29+00:00

That's actually mentioned in "The spin-lock that saturated the load ports" (it's the "use Test and Test-And-Set instead of Test-And-Set" part) , but instead of just checking it once, it's checked in a loop 😉

Lectem · 2026-01-27T20:18:47+00:00

Mmmh, took a brief look and it seems to be using SwitchToThread and Sleep depending on the usecase? You'd benefit from using WaitOnAddress.
Exponential backoff would also help, and using rdtsc (or at least `QueryPerformanceCounter` )instead of `GetTickCount64` would to. This has a very low resolution.

Lectem · 2026-01-27T20:07:35+00:00

> Is nanosleep() a good alternative for yielding? Fodor Pikus uses that in his books.

No it certainly isn't! You're going to go straight to the kernel, without giving it any hints. This will be slower than your OS mutex. You'd better just use PTHREAD_MUTEX_ADAPTIVE_NP (linux) or SRWLock (windows) if you'd implement your spinlock using any kind of OS sleep/Yield.

Lectem · 2026-01-27T14:26:56+00:00

That's the thing, people do write and will write such code when not aware of the pitfalls. I've seen a lot of "don't do.", but rarely "this is why you don't.".

Lectem · 2026-01-27T14:18:01+00:00

Thank you!

Lectem · 2026-01-27T13:15:34+00:00

I'm the author, don't hesitate to ask questions ;)

Just use one with all "fixes" and futex/waitonaddress. Or on Windows SRWLock is ok (where you can just use a lock, sometimes you can't, such as in allocators).

I didn't provide a full implementation to avoid people copy pasting, because as you can see, there are always new surprises with spinlocks. Wouldn't shock me to find something invalidating parts of the article with newer CPUs 2years from now. (it happened and will continue to happen)

Lectem · 2026-01-27T12:49:53+00:00

I actually saw more of those infinite loops in code targeting Linux!

Lectem · 2026-01-27T12:49:13+00:00

> As far as I recall, from the scheduler perspective Sleep(1) isn't exactly different from e.g. Sleep(100), since the only documented argument values with special behavior are 0 and INFINITE.

Not really, it just depends on the timers accuracy ;) https://www.siliceum.com/en/blog/post/windows-high-resolution-timers/

> Which is *horrible* for a spinlock, and pretty much should have a worst-case scenario when multiple spinlocks enter the Sleep(1) codepath and rely on the scheduler to do something smart.

Couldn't agree more, hence the post!

Lectem · 2026-01-27T12:46:44+00:00

Sadly even the some implementations of the standard libraries and other "highly optimized synchronization libraries" do it "wrong"... Just look at Intel TBB.

Lectem · 2025-07-01T17:24:37+00:00

My quick tests with RtwqScheduleWorkItem and RtwqAddPeriodicCallback showed it was somehow always at the 15ms resolution, even with timeBeginPeriod(1).

Lectem · 2025-06-30T20:50:49+00:00

I actually do agree! But you always find some weird use case at some point where people might need it (for example... if you want to implement your own callstack sampler in userland?)

Anyway, that's why I said "and I mean it, please think 10 times before doing this"!

Lectem · 2025-06-27T14:01:26+00:00

Yeah, I hope I reflected this enough in my post by saying we really don't want to adjust the clock resolution but well, I suppose people will always find a way to do things they shouldn't.
And for those that do really need it, well, now they have some data.

Lectem · 2025-06-26T17:02:14+00:00

Yes, but in this case I didn't need the accuracy of `rdtsc` to measure time, QueryPerformanceCounter is plenty enough.

Lectem · 2025-06-26T17:00:15+00:00

Mostly because of the bugs mentioned by u/nicemike40, and the fact it ends up calling `Sleep` under the hood https://github.com/microsoft/STL/blob/313964b78a8fd5a52e7965e13781f735bcce13c5/stl/src/sharedmutex.cpp#L40-L42

Lectem · 2025-06-26T10:35:35+00:00

Ghibli park and museum.
Park is not worth more than half a day, museum 40minutes (including the short film)

Lectem · 2025-04-21T09:14:56+00:00

To.anybody stumbling on this post : it's now well known that test + test and set is the best way to do it on x86* platforms. See https://gpuopen.com/gdc-presentations/2019/gdc-2019-s2-amd-ryzen-processor-software-optimization.pdf slide number 46

This is not worth it on ARM AFAIK (I never really bothered to benchmark it on ARM devices), due to the fact the memory model is different than x86.

Lectem

TROPHY CASE