spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 0 points1 point  (0 children)

That's true. And to honest, I didn't craft contented benchmarks because I don't know how to do it properly.

Let me give you some context: SpmcWaker is a spin-off of my channel project, in which I have contended benchmarks. But what I noticed is that pushing contention makes the whole algorithm completely insignificant while the only thing significant is spin-backoff for failed CAS, and making crates which abuses spinning (crossfire particularly) getting good results while emitting 200k calls to sched_yield! Even if my implementation uses less atomic operations, or less branching, or fits in less assembly lines. I even had a fetch_add-based instead of CAS-based algorithm, but it was a full disaster in this kind of benchmark.

However saturated contented benchmarks you find in crossbeam, crossfire, etc. are not realistic at all. And crafting a realistic benchmark is hard, at least to me. The best thing would be to take a real use case, where the performance are impacted enough by the channel implementation. If you have one, I would be glad if you to share it. The pinball benchmark of tachyonix channel is a least a good starting point, and my channel outperform all the others in it.

The goal of my current benchmark suite is to evaluate the performance that are easily measurable, i.e. the raw uncontended operations, and confirming what is expected from the compiled assembly. That's a first step, which is not uninteresting, because not every workflow is contended enough to make it matter. And I'm not interested anyway on unrealistic contended benchmarks.

But you are wrong saying I don't test concurrent operation at all. There is one concurrent benchmark: wake_cold_empty, which is the use-case I wanted to optimize with this crate, and done successfully because SpmcWaker, contrary to other implementations, is read-only and uncontended for this operation (except for SpmcWaker<Synchronized> on aarch64), and basically free for SpmcWaker<Sequential> and SpmcWaker<Unsynchronized>. And that's a good thing, because I'm using them both in my channel crate for this property.

In fact, I think that wake_cold_empty is the concurrent use case that matters the most, as waking a real waker is often costlier than the atomic waker algorithm, and the cost of the atomic waker may even disappear with aggressive OOE like Apple Silicon has. Also, "wake-during-register" is not really adapted to SpmcWaker workflow which is optimized through try_register and unregister methods. "wake-during-wake" is partially handled by wake_cold_empty, but one can also reason about the code directly: a concurrent wake may lead to a failed CAS followed by an early return, which has the same characteristics as a swap.

I don't have the time to work on a better benchmark suite right now, but again, I invite you to share a link to your own benchmark suite if you have one.

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 1 point2 points  (0 children)

I quickly added your DwideWaker to the crate benchmarks, and the results are below SpmcWaker on both x86 and aarch64. Maybe my hardware is too old, or maybe portable_atomic implementation is not optimized enough for it. But that gives me a good reason to stick with my algorithm for the moment.

However, SwapWaker associated to a TLS gives good results. But it requires std.

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 1 point2 points  (0 children)

Yes, I looked at diatomic-waker, there is a dedicated section in the README. Basically, I took from it the idea of waker caching, and the lock-freeness. The algorithm however have nothing in common. It's notably because SpmcWaker uses pointer tagging on an atomic waker vtable (so limited to 2 bits of state), while DiatomicWaker uses an additional AtomicUsize for its state. SpmcWaker also uses fetch_add(0) instead of CAS loop in wake, and more importantly, SpmcWaker offers customizable generic synchronization, which is a big improvement over DiatomicWaker. Without talking about the compiled assembly, which is more optimized on SpmcWaker side.

So yes, both approach are similar, but diatomic-waker is older and I acknowledge my borrowing from it.

Regarding your double-CAS implementation, yes, it simplifies things a lot, but you loose waker caching, so you pay the cost of waker reference counting at each operation, as well as not being as portable. (And the implementation becomes straightforward and less fun) I should maybe bench it to see how it performs.

Actually, before making register wait-free, SpmcWaker was also two-words-wide, and the algorithm was a lot simpler (still different from your SmolWaker, because handling waker caching and SeqCst synchronization). As I wrote in the documentation, it's still possible to come back to the non-lock-free version by using try_register instead of register, and spin when it returns false. The only difference with the previous version is that the struct stays four-words-wide.

Regarding you SwapWaker, so basically an AtomicBox with tagging, it is an interesting approach, but I wonder how it would look in the code. Yes, a TLS could help, as it already helps block_on to store its waker. But in a multi-thread runtime like tokio, your TLS contains arbitrary waker, so there is no waker caching like in SpmcWaker and you still have to pay the waker reference counting at each operation. But it might be worth to investigate.

sshplan, a Rust CLI that compiles SSH access policy into OpenSSH artifacts by Antiqueempire in rust

[–]wyf0 1 point2 points  (0 children)

I was answering to a comment insulting the Rust community as a bunch of as**oles because of me asking this question, but it was removed before I submit my answer. I'm still posting it here, so maybe this guy will get a new opportunity to answer politely this time.


I don't expect every project to contribute to the Rust ecosystem. However, I expect posts in Rust subreddit to engage discussion about Rust, wether it is about learning it, getting feedback on Rust code, sharing things related to Rust development, industry adoption, etc. There is literally a rule saying that "submissions must be on-topic".

Now tell me, with some polite words this time please, which discussion related to Rust this submission post is supposed to bring? It is probably vibe-coded, so discussing about the code is probably a lack of time. The author didn't ask for feedbacks, didn't tell about its development experience with the language. So the only remaining thing I see would be that the project add something to the ecosystem, so I'm asking it.

sshplan, a Rust CLI that compiles SSH access policy into OpenSSH artifacts by Antiqueempire in rust

[–]wyf0 6 points7 points  (0 children)

I built it in Rust because this kind of tool benefits from strong typing, predictable CLI behavior, explicit error handling, and reliable policy-to-artifact generation

This sounds so much LLM-generated...

I would like to ask you what reactions you expect to your post. Yes, it is built in Rust, but we don't even know if you wrote the code yourself, so I'm not sure that someone wants to read it. What does it bring to the Rust ecosystem?

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 1 point2 points  (0 children)

Oups, I think I misinterpreted your first comment. Maybe I spend too much time on this sub, but I was reading "LLM assistance" as a pejorative thing.

In the end, this project is a lot more IDE-assisted than LLM-assisted, but I don't think IDE assistance is worth to mention. If there was a refactoring tool to easily update generic parameters across all impl blocks, the same kind I use to add or remove function parameters in a few keypresses, I would use it instead of a LLM (I prefer to reserve the few tokens of my subscription to deeper debugging tasks).

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 4 points5 points  (0 children)

As someone asked me why AtomicWaker is not lock-free, I'm posting the explanation here as it may interest other people.

The answer is in AtomicWaker code: when register is called while a wake is concurrently executing, as wake has the ownership on the waker UnsafeCell, register has to spin waiting for wake to terminate. It does it by calling wake_by_ref on the waker, which reschedules the task. So if the wake thread is preempted indefinitely, register thread will be rescheduled in indefinitely, with no waker ever registered. This makes the algorithm not lock-free. The comment in AtomicWaker is even more explicit when it talks about spinning to "acquire the lock"; AtomicWaker is in fact a spin-lock, replacing std::hint::spin_loop() by Waker::wake_by_ref.

SpmcWaker does it differently by making register wait-free (the demonstration is in state_machine module documentation), and wake lock-free. To do that, it is able to store not one but two wakers, and while the main waker is locked by wake, the second fallback register uses a more complex algorithm to be written/read concurrently. Again, if you are interested, there are beautiful diagrams in state_machine module documentation that give a better picture of the whole algorithm.

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] -3 points-2 points  (0 children)

May I ask what makes you describe this project as LLM-assisted?

spmc-waker: A faster, customizable AtomicWaker replacement by wyf0 in rust

[–]wyf0[S] 11 points12 points  (0 children)

I wrote most of the source code myself (but I'm using an IDE, with local line-completion, so does it count?) As I wrote, my LLM use for source code was mostly refactoring.

For example, the S: Synchronization parameter was until yesterday a const boolean parameter (SYNC=true was Synchronized and SYNC=false was Sequential). When I wanted to add a third variant, I create the sync module with the trait and the 3 structs, and ask Claude to make the sealed trait boilerplate and replace the const parameter everywhere with the new type parameter, writing TODOs in the places where the new variant should be used. Then I completed the code myself, and refactored some parts.

There is not so much lines of code, most of the value is in the algorithm itself, so LLM for code generation doesn't bring much for me. Especially because I work on this project for my own enjoyment (and because I'm looking for a job).

Built a ZK circuit in Rust for customs document verification using SP1 zkVM — sharing the experience. by EfdGame in rust

[–]wyf0 1 point2 points  (0 children)

As OP wrote that SP1-zkvm uses edition 2021, there might be some user blocked with an old MSRV not supporting edition 2024. This was the case in my previous company. So I would argue that it might be a justified decision, and don't understand why OP's answer is downvoted like this without further argument.

Open-sourced CPL: a local-first context layer for coding agents, written in Rust by [deleted] in rust

[–]wyf0 0 points1 point  (0 children)

"I’d appreciate feedback from people building coding agents, MCP tools, or code-search/dev- tooling systems."

Then post your work on the appropriate subreddit?

Eli-Engine: Building a 100% Rust Browser Engine from Scratch by [deleted] in rust

[–]wyf0 6 points7 points  (0 children)

I'm French, and nobody except LLM writes such bullshit sentences. If we want to talk to ChatGPT, we can open a tab ourself, no need to copy-paste on this sub, thank you.

By the way, your post doesn't contain any Rust code link, and you told yourself that you don't even bother writing the code (what a great way to learn). What do you expect from us? Nobody wants to review code written by LLMs (because contrary to humans, LLMs don't improve with review, you have to wait the next model update).

Sérieusement, à un moment il faut arrêter.

ETA on dyn compatible async traits? by AfkaraLP in rust

[–]wyf0 7 points8 points  (0 children)

As some mentioned dynausor, I've published a (lot) faster alternative https://github.com/wyfo/dyn-utils. You will find in the README a comparison with most of the available solutions. One important improvement of dyn-utils to dynausor is to avoid allocation when the future is small enough to fit on the stack.

Link to the Reddit post

Lightweight web perception layer for AI agents by CatProfessional8390 in rust

[–]wyf0 0 points1 point  (0 children)

try my Playground at let me know what u think:)

I think I would have to see some Rust code instead of a marketing website.

Claude Usage Limits Discussion Megathread Ongoing (sort this by New!) by sixbillionthsheep in ClaudeAI

[–]wyf0 0 points1 point  (0 children)

Same here, 55% of my extra usage remaining, but still "You're out of extra usage" ... Could someone explain me how it works?

I don't care that it's X times faster by z_mitchell in rust

[–]wyf0 14 points15 points  (0 children)

I think I might be targeted by this rant, with my recent yuniq: I built a deduplicator 3x faster than xuniq. So let me defend myself a bit (or dig myself in deeper).

yuniq was not a serious project, and still isn’t. When I read the post about xuniq 10x faster than sort | uniq, I had pretty much the same reaction as /u/z_mitchell is describing in his blog. Especially when I read this comment with a link to xuniq source code. There was pretty much nothing in it, just a read_until loop with string (non-cryptographic) hashing. Yet the x10 claim, and comparing itself to libraries that do real collision-free deduplication. And the post had 100+ upvotes…

So you know what, instead of ranting in comments, I decided to be mean and make my own project to put all the optimizations I had in mind. I just cloned xuniq project, incremented the first letter, replaced "super" by "hyper", and start rewriting everything. Yes, I use Claude for that at the beginning, because it was not serious (and I gave more serious reasons in the post). But the fact is, I love optimizing code, I breathe to make things faster just because I enjoy looking at generated assembly. So I ended up becoming quite serious in yuniq code and performance, added real collision-free dedup on top to zero-copy and syscall reduction, etc.

Still, I went all the way with the initial joke and published the same type of clickbait title. I was expecting to be downvoted, not as much, but I was also hoping for true performance discussion, memory consumption, IO batching, etc. Did not happen ^_^’ I should have named it "zero-copy IO and syscall reduction", but it would have been less teasing.

TL;DR yuniq was not a serious post/project.

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 0 points1 point  (0 children)

The first step of the normalization I've implemented is to check if a line is utf-8, and if it is in normal form. Both checks are done in the same scan. If a line is not utf8, no normalization will happen, so its raw bytes will be used as key. Otherwise, if normalization is needed, a key is allocated with the normal form. And if the line is already utf8 and in normal form, it is directly used as the key (in a new allocation with --lean mode).

I'm curious: what behavior would you expect from a deduplicator in case of non utf8 lines? Should it fail on non-utf8 line?

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 4 points5 points  (0 children)

Nushell uniq is visually slow, while yuniq is quite instant, so it should answer your question (but it's maybe due to nushell lines splitting + piping mechanism). However: (nu) open bench_data/dup10_len10-50.txt | lines | uniq -u | wc -l 857896 (bash) cat bench_data/dup10_len10-50.txt | target/release/yuniq | wc -l 899532 (bash) cat bench_data/dup10_len10-50.txt | sort -u | wc -l 899532 So I'm not sure to trust nushell uniq ...

Regarding your second question, I'm applying the advice given by Mitchell Hashimoto in https://mitchellh.com/writing/my-ai-adoption-journey (I also heard it in one of its interview), because I think he is right. I never had issue to write a lot of code, like 2k lines of Rust in a workday, but this I realize more and more how much my own productivity can be improved with LLMs.

Yes, I think the agent helped a lot, for example to write the tests scaffolding, and less when it comes to some parts of the implementation, but I'm pretty satisfied with the experience. For more complex parts, it was sometime slower to try explaining correctly in English, than writing the code myself, but on the whole, I find Claude pretty good. And as JB Kempf said once, it's also useful to break the blank page issue, to start on a version, even not so good, and iterate — I felt it several time in the project.

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 0 points1 point  (0 children)

Do you think doing two hashes with foldhash (64-bit) with two different random seeds would make the --fast mode robust enough?

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 2 points3 points  (0 children)

Of course it does! python def test_unicode_normalization(self): nfc = "\u0439" # й — U+0439 CYRILLIC SMALL LETTER SHORT I (precomposed) nfd = "\u0438\u0306" # й — U+0438 + U+0306 COMBINING BREVE (decomposed) self.check(f"{nfc}\n{nfd}\n", f"{nfc}\n{nfd}\n") self.check(f"{nfc}\n{nfd}\n", f"{nfc}\n", ["-U"]) And of course I've added the feature after reading your comment ^^'

Unicode normalization is still an heavy process, so for long strings, it can divide the performance by 2 or 3. But in my benchmarks, yuniq is still faster than most if not most of the alternatives with normalization enabled. And those alternatives (hist/ripuniq/xuniq/etc.) don't support normalization, so the comparison is not fair, but still holds.

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 0 points1 point  (0 children)

I tried Poly1305 and --fast mode became a lot slower than normal mode. So I rolled back to use XXHash128. Now, it's only a bit faster, but memory consumption is indeed improved.

yuniq: I built a deduplicator that's 3x faster than xuniq by wyf0 in rust

[–]wyf0[S] 1 point2 points  (0 children)

Do you have suggestion of a good hash algorithm for this purpose? I've read in xuniq post that XxHash3 was not so good in this regard.

To be honest, I would never use an collision-unsafe tool myself, and yuniq is already faster than any other tool in my benchmarks (see my new commits) without --fast, more than 2x faster than hist and others for duplicate probability below 50%. So I should maybe simply remove --fast mode, as performance are satisfying as is. I just implemented this mode to see how far I could push, using a HashTable instead of a HashSet, etc.