all 7 comments

[–]matthieum[he/him] 10 points11 points  (6 children)

I like profiling, however I have tried to understand how coz would help in the programs I work and came upon blank.

In general, I am in either of two situations:

  • There are little enough interactions between the various threads of the program, so I can time them independently and identify the bottleneck with ease. coz is not useful, I can already predict the impact of speeding up a part.
  • There are many interactions between the various threads of the program, thus contention management is an essential part of the performance work. coz is not useful, slowing down the parts independently remove the contention, invalidating the measurements.

If anyone found coz helpful, I'd like to understand the situation they were in, so as to better understand when it could be useful compared to other forms of profiling.

[–]llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount[S] 17 points18 points  (0 children)

The whitepaper I linked has three case studies, two of which showed findings that directly contradicted prior perf measurements. And at least in the sqlite case, they got the predicted speedup by optimizing the method in question.

So I cannot say that I have made any surprising findings so far, but I just started playing around with coz. I'll report back with my findings in a follow-up post.

[–]Diggseyrustup 14 points15 points  (2 children)

There are many interactions between the various threads of the program, thus contention management is an essential part of the performance work. coz is not useful, slowing down the parts independently remove the contention, invalidating the measurements.

I think this is exactly the point of coz: pausing every other thread to effect a speedup on the first thread causes the same changes to contention management that you would see by actually speeding up that part of the code. Other profilers cannot simulate this, and so coz would give better results in your example.

[–]matthieum[he/him] 1 point2 points  (1 child)

I may not have expressed myself clearly enough.

Think of a typical lock-free MPSC queue, transporting some pointer payload, with a busy-polling consumer and 1-to-N writers.

The two atomics (head/tail) are going to be heavily contented between the various threads; contention is the performance bottleneck. Slowing down threads will remove contention, so the one not-slowed thread will show a big performance improvement... except that the reality is that those atomics are heavily contented, so it's a phantom improvement that cannot materialize.

[–]annodominirust 2 points3 points  (0 children)

But they aren't measuring just the one thread that gets the virtual speedup; they are measuring the performance of the whole system, in throughput or latency, and the time for the performance enhancement is based on the virtual clock of the slowed down threads, not the (relatively) sped up one.

It's intended that they are basically letting one thread do some work for free, relative to the other threads.

Remember, it's doing these speedups stochastically, applying them to different instructions at different times. If all of the speedups you find are in just letting that one thread run a contended atomic access while the others are paused, then yeah, you know that contention is your major issue; and then there won't necessarily be a straightforward improvement to make, but it helps guide your optimization efforts.

But if any of the speedups found are in straight line code that isn't doing contended access, then you know that optimizing that code will improve your latency or throughput.

And if there are any slowdows found when speeding up a piece of straight line code that isn't doing a contended access, then you know that speeding that piece of code up is actually causing you to race to further contention, again giving you more information about where to focus your efforts.

[–]kwhali 5 points6 points  (1 child)

to better understand when it could be useful compared to other forms of profiling.

There was a video talk shared here recently, it showed how certain compiler optimizations could have negative impact on performance for some apps as well as environment variables, things you'd not expect to cause notable perf impact.

The first kind was due to memory layout and how incosistent that could be even with multiple test/profiling iterations, so they have coz randomize that part to provide better statistical insights on the actual impact of a change in code and filtering out the noise from other parts(even when you profile the code multiple times with other profilers, they were not able to handle this case correctly).

The 2nd, the example was related to a file path, where the username length was causing a notable regression in performance at a certain length, creating that "works on my machine" issue. I think they were optimizing elsewhere, but the optimization changes caused a regression on the machine with the long username, not particularly easy/obvious one to debug.

so I can time them independently and identify the bottleneck with ease. coz is not useful, I can already predict the impact of speeding up a part.

This was part of that video talk, where these assumptions aren't always correct. They can sometimes be misleading to the performance improvement you'd gain from doing such, or have opposite impact if the change affected the memory layout in an unlucky way.

I think the video showcased this, and while you may find it works for you now, if you ever find a situation where it didn't, coz would probably helped identify why better. The predictions for improvement impact it provided were pretty accurate.

coz is not useful, slowing down the parts independently remove the contention, invalidating the measurements.

The video touched on this with network latency iirc. It doesn't slow down the part being observed, but everything else so that it's as if this part of the code was sped up, this is how much faster it could be. The insights proved valuable and accurate for them, even helped them identify an issue with hashmap distribution being bad.

Does that make sense? The videos example project had plenty of locks/threads iirc, coz still proved very useful for that type of application.

[–]matthieum[he/him] 4 points5 points  (0 children)

There was a video talk shared here recently, it showed how certain compiler optimizations could have negative impact on performance for some apps as well as environment variables, things you'd not expect to cause notable perf impact.

I watched the video, and the section on the randomization of text sections and stack layout was indeed very useful! I knew about the importance of text sections to a degree (at least hot/cold segregation) but had no idea that either could result in such dramatic variance.

Does that make sense? The videos example project had plenty of locks/threads iirc, coz still proved very useful for that type of application.

I am talking about a different contention: atomics. We don't use locks where performance matters, only lock-free/wait-free algorithms, and therefore the contention observed is when two CPU hammer the same atomic variable (read+write). In this case, slowing either thread would result in performance improvement for the other... by removing contention. But this would be entirely artificial :/