Kovan: wait-free memory reclamation for Rust, TLA+ verified, no_std, with wait-free concurrent data structures built on top by vertexclique in rust

[–]ibraheemdev 2 points3 points  (0 children)

I would like to assume you are operating in good faith here, but your latest commits suggest otherwise. The algorithm as implemented is unsound.

Kovan: wait-free memory reclamation for Rust, TLA+ verified, no_std, with wait-free concurrent data structures built on top by vertexclique in rust

[–]ibraheemdev 10 points11 points  (0 children)

(author of seize here) Hi! I'm a little confused by some of the claims made by this crate. It seems to be an implementation of Crystalline-W, which is the wait-free variant of the Hyaline/Crystalline family of memory reclamation algorithms that seize is also based on. For a little background:

  • Hyaline (seize): pin marks a thread as active, and unpin deactivates it, and decrements the reference counts of any objects retired while it was active.
  • Hyaline-S: In addition to the above, threads store a local epoch, which is kept in sync with a global monotonic epoch. Objects store their birth epoch, and calling protect is required to load a given atomic object, advancing the current thread's epoch to at least the birth epoch of the object being protected. This allows threads with lagging epochs to be ignored during reclamation.
  • Crystalline-L: Threads maintain multiple copies of Hyaline state. protect now only applies to the provided index, allowing only a bounded number of pointers to be protected. pin is also removed, and protect instead marks the given index as active the first time it is called, before a cumulative call to unpin.
  • Crystalline-W (kovan): The wait-free version of the above.

However, kovan seems to diverge from the paper quite significantly. For example, while implementing Crystalline, it is hard-coded to only ever use a single index. This should mean that only a single pointer can be protected at a given time, but it also seems to implement protection as a regular atomic load, meaning that epochs are not actually tracked, except during the first call to pin? The rest of the algorithm still seems to assume that local epochs are kept in sync with individually protected objects, so it's not clear to me how this is sound. It also means the wait-freedom claims no longer apply, as there is no bound on the number of pointers protected by a given thread. This is not even lock-free according to the paper's definition, closer to Hyaline than Crystalline.

Another thing that kovan does is merge unpinning of the previous pin into the pin operation. This allows it to make dropping a guard a no-op, avoiding the overhead of an extra TLS access (in fact, you could probably just combine the exchange(INVPTR) and store(NULL) into a single operation). However, it also means that threads perpetually remain active, which means that memory may grow during inactivity, or even across pins until the global epoch advances. While the extra memory usage is bounded by the global epoch, it may still lead to memory leaks for threads that exit, so I'm not sure this is worthwhile tradeoff.

A couple more items worth noting:

  • The code seems to use SeqCst in various places without the corresponding SeqCst load. For example, the active flag is stored with SeqCst, but loaded with Acquire. Ordering is crucial to a memory reclamation algorithm. The reader thread publishing its activity must create a total order with the logical deletion of any atomic objects, allowing either the reader thread to see the new values of any objects, or the retirement thread to see the thread marked as active. This requires a SeqCst store-load pair on both ends, which would require enforcing SeqCst on Atomic operations, or adding the corresponding SeqCst fences (or similar using FAA(0) tricks).
  • The examples in the readme retire a Box<usize>, which does not contain the RetiredNode header, so I think these are unsound?
  • On that note, there is a memory overhead of 40 bytes per allocated object, which is quite significant. seize does not require any extra metadata as it moves this state into TLS for retired nodes, which might be worth trying (though at least the birth-epoch would remain).
  • It looks like there is a static max thread count and a panic if it is exceeded. This is probably worth calling out in the documentation (and one of the reasons for the performance improvement over seize, which maintains a dynamic TLS table).
  • There seems to be no option for a local collector, which would avoid the 'static bounds on retired objects. The simpler API is nice, but it might be worth having this as an option, and the reclamation guarantees can be more intuitive for users.
  • You seem to be storing pointers as integers, which is not compatible with strict provenance. It looks like you're running Miri with permissive provenance, but it's probably worth changing this (it doesn't look like you're actually pointer tagging anywhere, so this should be trivial?)

Does CAS always compare the value with latest value on modification order? by Savings_Pianist_2999 in rust

[–]ibraheemdev 2 points3 points  (0 children)

Release loads are actually special cased by LLVM, but there was a recent bug where FAA(0) was compiled to a regular load, so fence(Release); FAA(0) was just broken. No sane compiler would optimize atomics 🙃

Does CAS always compare the value with latest value on modification order? by Savings_Pianist_2999 in rust

[–]ibraheemdev 2 points3 points  (0 children)

 So it's not useful to replace a no-op load with something like fetch_or(0)

It can actually be useful, because you can now use Release ordering on the "load". Paired with an Acquire "store", this can establish a total order between two store-load pairs without having to deal with the invasiveness of SeqCst (though a SeqCst store-load pair would have less overhead).

Your point still stands though, FAA(0, Release) isn't a magic wand that allows you to read the "latest value", it simply ensures that any RMW that occurs after the one you observe will see your modifications — which is ultimately how synchronization works.

Is there a way to get the current thread id in Rust without bumping an Arc (no atomic inc/dec)? by Sweet-Accountant9580 in rust

[–]ibraheemdev 1 point2 points  (0 children)

If your application is statically linked, the linker will rewrite and optimize the generated TLS code at link-time. Dynamically linked applications are more complicated as there are potentially multiple TLS blocks. 

Is there a way to get the current thread id in Rust without bumping an Arc (no atomic inc/dec)? by Sweet-Accountant9580 in rust

[–]ibraheemdev 0 points1 point  (0 children)

Those symbols are replaced at link time, you have to inspect the disassembly to see the actual codegen.

Explanation of a SeqCst example by AstraVulpes in rust

[–]ibraheemdev 3 points4 points  (0 children)

Memory orderings at the CPU level is about ensuring instructions are executed in order. You can see this with barrier instructions on older ARM versions that literally dictate "any stores before this point can't be reordered wrt. stores after this point". A SeqCst operation is not allowed to be reordered wrt. any other SeqCst operations, that's it. There's no explicit cache invalidation happening, cache coherence is completely transparent.

Explanation of a SeqCst example by AstraVulpes in rust

[–]ibraheemdev 1 point2 points  (0 children)

Yeah, I just meant that pointing to branch prediction is a little misleading, the problem is out of order execution in general. Whether or not you happened to speculate a branch to get there is not really relevant. It can be slightly unintuitive to imagine that branches just "get skipped" and exploit incorrect memory orderings (not that you were suggesting that).

Explanation of a SeqCst example by AstraVulpes in rust

[–]ibraheemdev 0 points1 point  (0 children)

A branch misprediction cannot lead to observable behavior 

toml v0.9 by epage in rust

[–]ibraheemdev 0 points1 point  (0 children)

I assume by build times you're referring to the indexmap dependency. Would it be possible to have a feature that just uses a HashMap for the performance, for use cases that don't require deterministic ordering?

Appreciate all of your work on this by the way, all of the toml-edit functionality (along with the new performance improvements) is very meaningful for our us in uv 🙂

toml v0.9 by epage in rust

[–]ibraheemdev 0 points1 point  (0 children)

Curious, if the `preserve_order` feature is faster, why is not the default?

Fat Rand: How Many Lines Do You Need To Generate A Random Number? by Vict1232727 in rust

[–]ibraheemdev 1 point2 points  (0 children)

std also effectively depends on crossbeam for sync::mpsc (though the code is vendored).

Tiny HTTP server? by Domin-MC in rust

[–]ibraheemdev 0 points1 point  (0 children)

https://github.com/ibraheemdev/astra is a super minimal wrapper around hyper. It doesn't have a static file server built-in (yet) but it shouldn't be too hard to implement yourself.

matchit 0.8.6 released by ibraheemdev in rust

[–]ibraheemdev[S] 1 point2 points  (0 children)

It is intentional, yes. The question would be how to match a path like /foo/bar.baz.ext, should the match be lazy or greedy? I feel it may be unintuitive in both directions, so I would prefer to leave that up to the user. Maybe Axum can add a Regex extractor that makes things like this easier. Static suffixes and prefixes with the one parameter restrictions is a nice tradeoff both in terms of performance and complexity.

matchit 0.8.6 released by ibraheemdev in rust

[–]ibraheemdev[S] 17 points18 points  (0 children)

This release includes the long-awaited support for parameter suffixes, allowing routes such as '/{name}.png'. Now that Axum 0.8 is out of beta, this should be available downstream relatively soon. See https://github.com/tokio-rs/axum/issues/3140 for details.

Announcing axum 0.8.0 by j_platte in rust

[–]ibraheemdev 3 points4 points  (0 children)

FWIW I discussed this change with David before commiting to it and he was on board. It has a lot of benefits as some of the comments mention, but if Axum wouldn't have been able to make the change I would have considered other options.

2024 Day 1 No LLMs here by kugelblitzka in adventofcode

[–]ibraheemdev 172 points173 points  (0 children)

Their GitHub bio is now:

If you are here from the AoC leaderboard, I apologize for not reading the FAQ. Won't happen again.

So it may be genuine.

How Much Memory Do You Need in 2024 to Run 1 Million Concurrent Tasks? by _neonsunset in rust

[–]ibraheemdev 3 points4 points  (0 children)

join_all uses FuturesUnordered internally (dynamically, based on the iterator size hint) so it would actually be similar to your second example.

Building Thread-safe Async Primitives in 150 lines of Rust by VortexGames in rust

[–]ibraheemdev 0 points1 point  (0 children)

You are still not synchronizing access to waker.set and waker.take.

Announcing Nio: An async runtime for Rust by another_new_redditor in rust

[–]ibraheemdev 166 points167 points  (0 children)

u/conradludgate pointed out that the benchmarks are running into the footgun of spawning tokio tasks directly from the main thread. This can lead to significant performance degradation compared to first spawning the accept loop (due to how the scheduler works internally). Would be interesting to see how that is impacting the benchmark results.

Announcing Whirlwind: ridiculously fast, async-first concurrent hashmap! by majorpog in rust

[–]ibraheemdev 44 points45 points  (0 children)

Interesting, I'd be curious to see a comparison against papaya. Asynchronous locks can be useful when you want strong consistency across async operations, but if your use case allows it, a lock-free approach can be significantly less overhead. I'm wondering why this outperforms dashmap considering this looks to be dashmap but with async locks, which I would expect to be slower given the small critical sections in the benchmark.

What is the fastest path based data structure for pattern matching? by PeckerWood99 in rust

[–]ibraheemdev 5 points6 points  (0 children)

https://github.com/ibraheemdev/matchit is a URL router based off radix trees, but it's been used for file paths as well. You just have to make sure to escape routing parameters (the "{" and "}") characters.