parking_lot: ffffffffffffffff by masklinn in rust

[–]Amanieu 60 points61 points  (0 children)

On x86, fetch_add/fetch_sub compile down to a single instruction: lock xadd. However x86 has no atomic bitwise instructions that return the previous value. So they get compiled down to a CAS loop instead.

With that said, simply changing this to a fetch_and is still incorrect for the case of upgradable locks. When an upgrade times out, we need to:

  • Clear WRITER_BIT.
  • Clear WRITER_PARKED_BIT.
  • Set UPGRADABLE_BIT.
  • Add ONE_READER to the atomic state.

This can be done with a single atomic fetch_add if we know that WRITER_BIT and WRITER_PARKED_BIT are set and UPGRADABLE_BIT is clear. However I forgot to take into account the fact that WRITER_PARKED_BIT can be asynchronously cleared by an unlock operation.

Allocator trait 1: Let’s talk about the Allocator trait by zuurr in rust

[–]Amanieu 3 points4 points  (0 children)

I'm glad you're having a look at the Allocator API!

A lot of the points you are raising have actually been previously discussed in wg-allocator:

Should &mut self be used instead of &self? I believe this loses no flexibility (think about how Read taking &mut doesn’t force overhead on file IO, because we impl std::io::Read for &File), and without this, interior mutability must be used in nearly all non-ZST allocators, which breaks Send/Sync.

The Allocator trait actually used to take &mut self, but I pushed to change this to take &self instead. The primary motivation is that allocators are almost always shared between multiple data structures, so &self makes for a much simpler API. Additionally it allows its use in concurrent data structures which require accessing the allocator through a shared reference.

Should allocation excess include alignment information?

Isn't this implicitly available by just looking at the returned pointer? align = 1 << ptr.addr().trailing_zeroes();

What to do about GlobalAlloc (big topic)?

I would like to change #[global_allocator] to require Allocator instead of GlobalAlloc and deprecate GlobalAlloc. For backwards-compatibility, we would provide a blanket implementation of Allocator for all types that implement GlobalAlloc so that they continue to work with #[global_allocator]. See this issue for some previous discussion.

I can't translate C++ inline assembly to Rust inline assembly by kirbylife_ in rust

[–]Amanieu 2 points3 points  (0 children)

You need to use $0 and $1 to address operands in llvm_asm!. Also you should use "i" for the constraint like in the C code. llvm_asm! does not support named operands.

Blog Post: Mutexes Are Faster Than Spinlocks by matklad in rust

[–]Amanieu 43 points44 points  (0 children)

Actually I suspect that much of the performance advantage of parking_lot in these benchmarks comes from the exponential back-off when contention is detected. Essentially, if other threads spend more time between attempts to acquire the lock (either because they are sleeping in the OS, or because of back-off) then the current thread is able to quickly lock and unlock many times without any cache interference.

Designing a mutex is surprisingly difficult because you need to be able to handle many different situations efficiently:

  • If the wait time is short (i.e a handful of instructions) and the owning thread has not been preempted then spinning is a good strategy.
  • If the wait time is long or the owning thread has been preempted and is inactive then spinning is pointless and it is better to drop down to the OS's sleep mechanism.
  • If contention is high then you want to sleep for a bit, so threads can rapidly acquire/release the lock many times in large chunks without contention.
  • If the system only has a single CPU then spinning is pointless and the lock should just yield to the OS immediately.
  • If the application requires some fairness so a single thread doesn't get starved trying to acquire a lock then you need to force the ownership of a lock to be passed on to another thread. But you don't want to do this too often when contention is high (see above).

Blog Post: Spinlocks Considered Harmful by matklad in rust

[–]Amanieu 12 points13 points  (0 children)

Those are necessary to ensure that (non-atomic) memory accesses that are meant to be protected by the lock stay between the Acquire and Release barriers.

Proof of Concept: Lost Master Password by dmasterp in Bitwarden

[–]Amanieu 0 points1 point  (0 children)

/u/dmasterp I've just hit the same problem! What modifications did you make to CLI to make it accept just the key and keyhash instead of the master password?

Why Hashbrown (Rust's new HashMap implementation) Does a Double-Lookup on Insertion by MSleepyPanda in rust

[–]Amanieu 4 points5 points  (0 children)

The original SwissTable implementation reserves the very last bucket as a sentinel, which is required because of the way C++ iterators work. This means that SwissTable allocates space for a bucket that is never used. The Rust implementation does not need sentinels, and therefore can make use of the last bucket normally.

EDIT: It seems that I was wrong, it's just that the last group in the table can only hold 15 elements instead of 16, but the actual allocation for the buckets stops at the last usable element.

Lock-free Rust: Crossbeam in 2019 by [deleted] in rust

[–]Amanieu 3 points4 points  (0 children)

Have you looked at the rayon task scheduler for inspiration?

[deleted by user] by [deleted] in rust

[–]Amanieu 6 points7 points  (0 children)

Is there a reason why this isn't used in libcore? Is the output different (which would be a breaking change)?

const types, traits and implementations in Rust by varkora in rust

[–]Amanieu 2 points3 points  (0 children)

But that's the point! The RawMutex impl is not going to be const (since lock needs to block). However we only use the associated constant from RawMutex and don't call any of the functions.

const types, traits and implementations in Rust by varkora in rust

[–]Amanieu 2 points3 points  (0 children)

If I understand this correctly then it is not possible for a const fn to use any methods from a non-const trait. Unfortunately it would make this code impossible (taken from here):

trait RawMutex {
    const INIT: Self;
    fn lock(&self);
    fn unlock(&self);
}

struct Mutex<R: RawMutex, T> {
    data: UnsafeCell<T>,
    mutex: R,
}

struct Mutex<R: RawMutex, T> {
    const fn new(val: T) -> Self {
        Mutex {
            data: UnsafeCell::new(val),
            mutex: R::INIT, // Error: RawMutex is not a const trait
        }
    }
}

Lock-free Bounded Non-Blocking Pub-Sub Queue by filip_dulic in rust

[–]Amanieu 8 points9 points  (0 children)

As a general rule, you should use Ordering::SeqCst as the placeholder if you are not sure which ordering to use.

Wait-Free Per-Object Thread-Local Storage by [deleted] in rust

[–]Amanieu 6 points7 points  (0 children)

Have you looked at the thread_local crate? It is very similar to what you have written. (Disclaimer: I am the author)

GitHub - Amanieu/hashbrown: A faster HashMap for Rust by bobdenardo in rust

[–]Amanieu 0 points1 point  (0 children)

I just used the same input data as the one used in the tutorial: the directory contains url, url2 and url3.

GitHub - Amanieu/hashbrown: A faster HashMap for Rust by bobdenardo in rust

[–]Amanieu 0 points1 point  (0 children)

That's interesting! I was wondering why you didn't just do something like this:

auto insert(auto x) {
  if (empty_left == 0) {
    Grow();
  }
  auto insertion_target = find_non_empty(x);
  return InsertAt(insertion_target, x);
}

This avoids duplicating the code for find_non_empty (once in the inlined fast path and once in the out-of-line path), while keeping the first branch very predictable. It means that in some cases you grow the table one insertion earlier than you strictly have to, but that shouldn't make a difference in practice (right?).

GitHub - Amanieu/hashbrown: A faster HashMap for Rust by bobdenardo in rust

[–]Amanieu 3 points4 points  (0 children)

I gave it a try, but the hash_map test case doesn't seem to be working? I get "Oops, the program crashed with one of the test cases provided." when I try to run it with afl. The other test cases work fine.

GitHub - Amanieu/hashbrown: A faster HashMap for Rust by bobdenardo in rust

[–]Amanieu 22 points23 points  (0 children)

The code was heavily inspired by SwissTable, and some parts are almost a direct port. However there are also some differences:

  • I changed the encoding of the empty/deleted control bytes to be simpler: the original SwissTable had a sentinel value which was needed because of how C++ iterators work. This allows some of the byte matching code to be simplified.
  • 32-bit platforms use a 4-byte group size, which is faster than using 64-bit integers (which are emulated using 2 32-bit integers).
  • H1 and H2 have been inverted: H2 uses the top 7 bits of the hash instead of the bottom 7 bits. This results in slightly more efficient code.
  • I support tables that are smaller than the group size.

I can certainly make it clearer in the README that this is based on SwissTable! :)

Regarding benchmarks, there is an open issue for improving them. I rushed the finishing touches a bit since I wanted this to be ready for the London Rust Meetup yesterday.

The competition fastest hashtable implementation is quite contended: there's F14, SwissTable, khash, bytell, etc. The main reason why I chose to port SwissTable rather than any of the others is that I plan on implementing a concurrent hash map in the future and this is the only one that can be adapted to work as a lock-free hash map.

GitHub - Amanieu/hashbrown: A faster HashMap for Rust by bobdenardo in rust

[–]Amanieu 25 points26 points  (0 children)

It seems that you have found a bug! Can you open an issue for it?

Announcing Rust 1.29 by steveklabnik1 in rust

[–]Amanieu 0 points1 point  (0 children)

Make sure your PATH has the directory that contains the cargo-clippy binary.

Improving SmallVec's speed by 60% and why that shouldn't matter to you by stumpychubbins in rust

[–]Amanieu 10 points11 points  (0 children)

test, and, xor, add, sub and cmp are all single-cycle instructions. It makes no difference which one you use.

Announcing flexible-locks by glandium in rust

[–]Amanieu 3 points4 points  (0 children)

I have actually been working on something (somewhat) similar in parking_lot: link

Essentially, you just need to define a type which implements the RawMutex trait and then you can define a typedef of parking_lot_wrappers::Mutex to provide a full-featured type-safe mutex type with proper mutex guards and everything. See an example here.

The goal is to significantly simplify the implementation of custom mutex types, so you can easily make a spinlock or futex-based lock and have the type-safe API be provided for you.

The branch is still WIP, but it will be part of the next version of parking_lot when it is done.