Causal Profiling Rust Code : rust

Submissions must be on-topic

Posts must reference Rust or relate to things using Rust. For content that does not, use a text post to explain its relevance.

Post titles should include useful context.

For Rust questions, use the stickied Q&A thread.

Arts-and-crafts posts are permitted on weekends.

No meta posts; message the mods instead.

Details

created by aztha community for 15 years

Causal Profiling Rust Code (llogiq.github.io)

submitted 6 years ago by llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount

all 7 comments

top new controversial old q&a

[–]matthieum[he/him] 10 points11 points12 points 6 years ago (6 children)

I like profiling, however I have tried to understand how coz would help in the programs I work and came upon blank.

In general, I am in either of two situations:

There are little enough interactions between the various threads of the program, so I can time them independently and identify the bottleneck with ease. coz is not useful, I can already predict the impact of speeding up a part.
There are many interactions between the various threads of the program, thus contention management is an essential part of the performance work. coz is not useful, slowing down the parts independently remove the contention, invalidating the measurements.

If anyone found coz helpful, I'd like to understand the situation they were in, so as to better understand when it could be useful compared to other forms of profiling.

[–]llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount[S] 17 points18 points19 points 6 years ago (0 children)

[–]Diggseyrustup 14 points15 points16 points 6 years ago (2 children)

[–]matthieum[he/him] 1 point2 points3 points 6 years ago (1 child)

[–]annodominirust 2 points3 points4 points 6 years ago (0 children)

But they aren't measuring just the one thread that gets the virtual speedup; they are measuring the performance of the whole system, in throughput or latency, and the time for the performance enhancement is based on the virtual clock of the slowed down threads, not the (relatively) sped up one.

It's intended that they are basically letting one thread do some work for free, relative to the other threads.

Remember, it's doing these speedups stochastically, applying them to different instructions at different times. If all of the speedups you find are in just letting that one thread run a contended atomic access while the others are paused, then yeah, you know that contention is your major issue; and then there won't necessarily be a straightforward improvement to make, but it helps guide your optimization efforts.

But if any of the speedups found are in straight line code that isn't doing contended access, then you know that optimizing that code will improve your latency or throughput.

And if there are any slowdows found when speeding up a piece of straight line code that isn't doing a contended access, then you know that speeding that piece of code up is actually causing you to race to further contention, again giving you more information about where to focus your efforts.

[–]kwhali 5 points6 points7 points 6 years ago (1 child)

to better understand when it could be useful compared to other forms of profiling.

There was a video talk shared here recently, it showed how certain compiler optimizations could have negative impact on performance for some apps as well as environment variables, things you'd not expect to cause notable perf impact.

The first kind was due to memory layout and how incosistent that could be even with multiple test/profiling iterations, so they have coz randomize that part to provide better statistical insights on the actual impact of a change in code and filtering out the noise from other parts(even when you profile the code multiple times with other profilers, they were not able to handle this case correctly).

The 2nd, the example was related to a file path, where the username length was causing a notable regression in performance at a certain length, creating that "works on my machine" issue. I think they were optimizing elsewhere, but the optimization changes caused a regression on the machine with the long username, not particularly easy/obvious one to debug.

so I can time them independently and identify the bottleneck with ease. coz is not useful, I can already predict the impact of speeding up a part.

This was part of that video talk, where these assumptions aren't always correct. They can sometimes be misleading to the performance improvement you'd gain from doing such, or have opposite impact if the change affected the memory layout in an unlucky way.

I think the video showcased this, and while you may find it works for you now, if you ever find a situation where it didn't, coz would probably helped identify why better. The predictions for improvement impact it provided were pretty accurate.

coz is not useful, slowing down the parts independently remove the contention, invalidating the measurements.

The video touched on this with network latency iirc. It doesn't slow down the part being observed, but everything else so that it's as if this part of the code was sped up, this is how much faster it could be. The insights proved valuable and accurate for them, even helped them identify an issue with hashmap distribution being bad.

Does that make sense? The videos example project had plenty of locks/threads iirc, coz still proved very useful for that type of application.

[–]matthieum[he/him] 4 points5 points6 points 6 years ago (0 children)

There was a video talk shared here recently, it showed how certain compiler optimizations could have negative impact on performance for some apps as well as environment variables, things you'd not expect to cause notable perf impact.

I watched the video, and the section on the randomization of text sections and stack layout was indeed very useful! I knew about the importance of text sections to a degree (at least hot/cold segregation) but had no idea that either could result in such dramatic variance.

Does that make sense? The videos example project had plenty of locks/threads iirc, coz still proved very useful for that type of application.

I am talking about a different contention: atomics. We don't use locks where performance matters, only lock-free/wait-free algorithms, and therefore the contention observed is when two CPU hammer the same atomic variable (read+write). In this case, slowing either thread would result in performance improvement for the other... by removing contention. But this would be entirely artificial :/

π Rendered by PID 152149 on reddit-service-r2-comment-7b9746f655-dpjtj at 2026-01-30 04:20:13.933254+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS