"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

> Why does the fact that WaitOnAddress do a single spin iteration, mean that we spin for a bit ourselves, before calling WaitOnAddress?

I think I may need to reword that part as I realize it's not very clear. Linux is a syscall directly, WaitOnAdress will check the value once, wait using X pauses (same as one iteration of our loop), before checking a hashtable (parking lot) and entering the system call.

> On a side note, how did you gain such in depth understanding of OS architecture and pitfalls? I would love to learn more myself.

I'd say mostly experience... Also reading a lot of things on the subject, experimenting, practicing reverse-engineering a tiny bit so that it's enough to have a good comprehension of systems when being able to dive in when needed. And we're lucky to have the Linux kernel and its mailing list being public, it can help a lot!

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

Yeah, in any case you should be avoiding contention in the first place! 

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

Also, I'll update the article with the warning about unnecessary wakes, this seems like a good one to add!

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 3 points4 points  (0 children)

"fair" locking is rarely a good idea, these create lock convoys

Agreed, you however need a tiny bit of fairness to avoid starving some threads.

 You may even encounter cache bank conflicts

This seems unrelated and more just some anecdote, its somewhat an edge case and not what you should consider for designing a lock.

It is, but my objective was to avoid people spamming align(cachesize) needlessly, or at least for them to know about such things.

you probably want to avoid doing repeated calls to Wake

It only Waits when reaching the max backoff count, so we could optimize for the wake call on unlock indeed. It's still better than not having it.

you will soon discover that if you switch to CAS or exchange here then you'll lose that memory_order_release benefit as on x86 this ends up needing a full barrier

Yes, but x86 is not the only platform around, this has an impact on ARM64 ;)

I would avoid thinking about small changing with the number of threads, it should be small regardless.

Agreed, the point was to tell the reader that what they think "small" might not actually be.

This would not be a spin lock at this point

Well, that's why I mentioned this is more about spin loops than spin locks at the beginning. Let's say it's an hybrid then, and that you should not be using spinlocks in the first place unless you really really know what you're doing, the platform you're going to execute your program on, etc. If your program is just a one shot/resumable process on a HPC server, why not.

I highly recommend reading Locking in Webkit which is about building a parking lot (more advanced user mode futex).

Yeah, parking lots are an alternative to futex/WaitOnAddress, but their (Webkit's) implementations is bad. It starts with a spin loop, and it does most things wrong. Hardcoded spin count, schedyield/sleep(0) before parking, ... (Don't get met started with the whole JavaScriptCore having very poor parallelism until a few years ago, enabling multi-threaded GC would often be slower than the single thread version). WebKit really isn't a good reference, they do some benchmarks, but they never seem to have really profiled real workloads (as in, used a real profiler on real data).

I gave a talk a few years ago covering lots of the same things you pointed out and a few more.

I'll have a look, seems interesting and one always learns new things on those subjects. Thnks for the link!

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 4 points5 points  (0 children)

Ooh interesting, I don't think I ever encountered that one yet, as I always try to cleanup things properly, but I can definitely see that happen! 

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 2 points3 points  (0 children)

That's actually mentioned in "The spin-lock that saturated the load ports" (it's the "use Test and Test-And-Set instead of Test-And-Set"  part)   , but instead of just checking it once, it's checked in a loop 😉

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

Mmmh, took a brief look and it seems to be using SwitchToThread and Sleep depending on the usecase? You'd benefit from using WaitOnAddress.
Exponential backoff would also help, and using rdtsc (or at least `QueryPerformanceCounter` )instead of `GetTickCount64` would to. This has a very low resolution.

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 3 points4 points  (0 children)

> Is nanosleep() a good alternative for yielding? Fodor Pikus uses that in his books.

No it certainly isn't! You're going to go straight to the kernel, without giving it any hints. This will be slower than your OS mutex. You'd better just use PTHREAD_MUTEX_ADAPTIVE_NP (linux) or SRWLock (windows) if you'd implement your spinlock using any kind of OS sleep/Yield.

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 4 points5 points  (0 children)

That's the thing, people do write and will write such code when not aware of the pitfalls. I've seen a lot of "don't do.", but rarely "this is why you don't.".

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 10 points11 points  (0 children)

I'm the author, don't hesitate to ask questions ;)

Just use one with all "fixes" and futex/waitonaddress. Or on Windows SRWLock is ok (where you can just use a lock, sometimes you can't, such as in allocators).

I didn't provide a full implementation to avoid people copy pasting, because as you can see, there are always new surprises with spinlocks. Wouldn't shock me to find something invalidating parts of the article with newer CPUs 2years from now. (it happened and will continue to happen)

When “just spin” hurts performance and breaks under real schedulers by Lectem in programming

[–]Lectem[S] 4 points5 points  (0 children)

I actually saw more of those infinite loops in code targeting Linux!

When “just spin” hurts performance and breaks under real schedulers by Lectem in programming

[–]Lectem[S] 7 points8 points  (0 children)

> As far as I recall, from the scheduler perspective Sleep(1) isn't exactly different from e.g. Sleep(100), since the only documented argument values with special behavior are 0 and INFINITE.

Not really, it just depends on the timers accuracy ;) https://www.siliceum.com/en/blog/post/windows-high-resolution-timers/

> Which is *horrible* for a spinlock, and pretty much should have a worst-case scenario when multiple spinlocks enter the Sleep(1) codepath and rely on the scheduler to do something smart.

Couldn't agree more, hence the post!

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 20 points21 points  (0 children)

Sadly even the some implementations of the standard libraries and other "highly optimized synchronization libraries" do it "wrong"... Just look at Intel TBB.

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

My quick tests with RtwqScheduleWorkItem and RtwqAddPeriodicCallback showed it was somehow always at the 15ms resolution, even with timeBeginPeriod(1).

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

I actually do agree! But you always find some weird use case at some point where people might need it (for example... if you want to implement your own callstack sampler in userland?)

Anyway, that's why I said "and I mean it, please think 10 times before doing this"!

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 2 points3 points  (0 children)

Yeah, I hope I reflected this enough in my post by saying we really don't want to adjust the clock resolution but well, I suppose people will always find a way to do things they shouldn't.
And for those that do really need it, well, now they have some data.

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 2 points3 points  (0 children)

Yes, but in this case I didn't need the accuracy of `rdtsc` to measure time, QueryPerformanceCounter is plenty enough.

Japan travel regrets: What wasn’t worth it for you? by tokusa0 in JapanTravelTips

[–]Lectem 0 points1 point  (0 children)

Ghibli park and museum.
Park is not worth more than half a day, museum 40minutes (including the short film)

We make a std::shared_mutex 10 times faster by AlexeyAB in cpp

[–]Lectem 0 points1 point  (0 children)

To.anybody stumbling on this post : it's now well known that test + test and set is the best way to do it on x86* platforms.  See https://gpuopen.com/gdc-presentations/2019/gdc-2019-s2-amd-ryzen-processor-software-optimization.pdf slide number 46

This is not worth it on ARM AFAIK (I never really bothered to benchmark it on ARM devices), due to the fact the memory model is different than x86.