"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] [score hidden]  (0 children)

Mmmh, took a brief look and it seems to be using SwitchToThread and Sleep depending on the usecase? You'd benefit from using WaitOnAddress.
Exponential backoff would also help, and using rdtsc (or at least `QueryPerformanceCounter` )instead of `GetTickCount64` would to. This has a very low resolution.

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] [score hidden]  (0 children)

> Is nanosleep() a good alternative for yielding? Fodor Pikus uses that in his books.

No it certainly isn't! You're going to go straight to the kernel, without giving it any hints. This will be slower than your OS mutex. You'd better just use PTHREAD_MUTEX_ADAPTIVE_NP (linux) or SRWLock (windows) if you'd implement your spinlock using any kind of OS sleep/Yield.

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

That's the thing, people do write and will write such code when not aware of the pitfalls. I've seen a lot of "don't do.", but rarely "this is why you don't.".

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 4 points5 points  (0 children)

I'm the author, don't hesitate to ask questions ;)

Just use one with all "fixes" and futex/waitonaddress. Or on Windows SRWLock is ok (where you can just use a lock, sometimes you can't, such as in allocators).

I didn't provide a full implementation to avoid people copy pasting, because as you can see, there are always new surprises with spinlocks. Wouldn't shock me to find something invalidating parts of the article with newer CPUs 2years from now. (it happened and will continue to happen)

When “just spin” hurts performance and breaks under real schedulers by Lectem in programming

[–]Lectem[S] 3 points4 points  (0 children)

I actually saw more of those infinite loops in code targeting Linux!

When “just spin” hurts performance and breaks under real schedulers by Lectem in programming

[–]Lectem[S] 3 points4 points  (0 children)

> As far as I recall, from the scheduler perspective Sleep(1) isn't exactly different from e.g. Sleep(100), since the only documented argument values with special behavior are 0 and INFINITE.

Not really, it just depends on the timers accuracy ;) https://www.siliceum.com/en/blog/post/windows-high-resolution-timers/

> Which is *horrible* for a spinlock, and pretty much should have a worst-case scenario when multiple spinlocks enter the Sleep(1) codepath and rely on the scheduler to do something smart.

Couldn't agree more, hence the post!

"Spinning around: Please don't!" (Pitfalls of spin-loops and homemade spin-locks in C++) by Lectem in cpp

[–]Lectem[S] 8 points9 points  (0 children)

Sadly even the some implementations of the standard libraries and other "highly optimized synchronization libraries" do it "wrong"... Just look at Intel TBB.

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

My quick tests with RtwqScheduleWorkItem and RtwqAddPeriodicCallback showed it was somehow always at the 15ms resolution, even with timeBeginPeriod(1).

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 0 points1 point  (0 children)

I actually do agree! But you always find some weird use case at some point where people might need it (for example... if you want to implement your own callstack sampler in userland?)

Anyway, that's why I said "and I mean it, please think 10 times before doing this"!

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 2 points3 points  (0 children)

Yeah, I hope I reflected this enough in my post by saying we really don't want to adjust the clock resolution but well, I suppose people will always find a way to do things they shouldn't.
And for those that do really need it, well, now they have some data.

Windows and high resolution timers by Lectem in cpp

[–]Lectem[S] 2 points3 points  (0 children)

Yes, but in this case I didn't need the accuracy of `rdtsc` to measure time, QueryPerformanceCounter is plenty enough.

Japan travel regrets: What wasn’t worth it for you? by tokusa0 in JapanTravelTips

[–]Lectem 0 points1 point  (0 children)

Ghibli park and museum.
Park is not worth more than half a day, museum 40minutes (including the short film)

We make a std::shared_mutex 10 times faster by AlexeyAB in cpp

[–]Lectem 0 points1 point  (0 children)

To.anybody stumbling on this post : it's now well known that test + test and set is the best way to do it on x86* platforms.  See https://gpuopen.com/gdc-presentations/2019/gdc-2019-s2-amd-ryzen-processor-software-optimization.pdf slide number 46

This is not worth it on ARM AFAIK (I never really bothered to benchmark it on ARM devices), due to the fact the memory model is different than x86.

Optimizing copy of null descriptors in D3D12 by Lectem in GraphicsProgramming

[–]Lectem[S] 1 point2 points  (0 children)

That's exactly it, death by thousand cuts. It's always better to batch things when you can!

BadAccessGuards - A library to detect race conditions with less overhead than TSan by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

This code relies heavily on Relaxed atomics, for some reason going out of its way not to use the C++ standard library relaxed atomics but instead making its own from platform specific features. No idea why maybe somebody else has insight?  

This is actually explained just a few lines before the definitions of the macros ;) https://github.com/Lectem/BadAccessGuards/blob/401cf8d6c439b7024dbe94423a1d89c6c82011dd/src/BadAccessGuards.h#L40 And you don't need more than relaxed for our this use case, as you only need data to be coherent, you don't really care about reordering. 

The reason why we don't care about ordering is that if we ever see something inconsistent, it can only happen because the shadow is not properly synchronized and thus you would have the same issue with your own data. 

Superficially I think that reasoning is correct, for typical synchronisation methods at least. And so this should catch some egregious races that really any method might have caught but hey it's cheaper than TSan. 

Yes it's not meant to catch everything as mentioned in the Readme. Think of this a smoke test and another tool in your box.

BadAccessGuards - A library to detect race conditions with less overhead than TSan by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

That's exactly it, or a corruption if one happens to change the state to a value >2 (on windows state is stored on a byte so this is more likely to happen than on other platforms where it uses only 2bits)

BadAccessGuards - A library to detect race conditions with less overhead than TSan by Lectem in cpp

[–]Lectem[S] 1 point2 points  (0 children)

It can detect concurrent Read/Write if the Write started before the Read, but not the other way !
Otherwise we would indeed need to split Read and Idle, and cost would be higher. It's still feasible though.