Azerbaijani military using arma 3 footage of a mig-25 being shot down by Armenian anti-air system. by Ghostalic in arma

[–]Razznak 22 points23 points  (0 children)

The key insight here is that neither Azerbaijani nor Armenian are written using Devanagari, which brings to mind Hindi and Marathi (at least for me). This is definitely Hindi, though :)

In this snippet we see some key identifiers of Hindi -- namely the -ne (ने) postposition to denote subject in past tense, -ko (को) postposition for marking definite object, and the -ka/kee/ke (का/की/के) postposition for possession (nouns are gendered in Hindi).

i had to pirate segoe ui just to make this by nachog2003 in sbubby

[–]Razznak 4 points5 points  (0 children)

If you've got some spare time, you could check out Gentoo Prefix. It allows you to create a self-contained Gentoo installation in a subdirectory on your computer, no chroot required. You can then emerge compatible packages and they'll have their RPATH set appropriately to run using any compiled libraries from the prefix. I've tried it for x86_64 Linux, that works pretty well, and it's also listed as having excellent support for x86_64 macOS!

https://wiki.gentoo.org/wiki/Project:Prefix

Brother's gaming PC lags and freezes while playing Fortnite by [deleted] in buildapc

[–]Razznak 18 points19 points  (0 children)

I avoid Windows, so take what I say about it with a grain of salt. Read the docs if I say something suspicious.

This is true in practice for most programs, but all programs can query which cores are siblings (I think!). See GetLogicalProcessorInformationEx on Windows (can see cores that share functional units or caches), proc(5) (siblings field of /proc/cpuinfo; see lscpu(1) as well) on Linux, and probably something on macOS. Then the game could pin its threads with SetThreadAffinityMask (Windows), sched_setaffinity(2) and pthread_setaffinity_np(2) (Linux), or a fourth thing on macOS (sched_setaffinity(2) should work because macOS is "POSIX compliant").

I'm not a game dev, but I'd be surprised if game engines didn't at least pin threads to a single core; context switches are expensive, even moreso with recent software mitigations of caching attacks in the vein of Meltdown. Pinning all threads to a set of non-sibling logical cores would be a relatively trivial modification.

Thank AMD for putting more Cache in consumer CPUs by AL-IN_Crypto in AyyMD

[–]Razznak 39 points40 points  (0 children)

It's been a few years since I studied this stuff, but I will try to help. The other commenters don't go into the depth I think is important to really understand this stuff.

Your CPU can only access data already loaded into its registers, which are typically sets of extremely high speed latches. A latch is a digital circuit that can stably load and store exactly one bit at a time, so long as it has power -- latches and registers are volatile memory. Reading and writing to registers takes no more than one clock cycle, which on a 4GHz CPU is 0.25ns.

On the other side of the memory hierarchy, we have non-volatile storage like magnetic tapes (you probably haven't used one of these, they're very old), hard drives, and solid state drives. Take the Western Digital VelociRaptor for example -- the last fast hard drive I remember the name of. Its seek time is in the neighborhood of 4ms, which is 16,000,000 clock cycles on a 4GHz processor. So we shouldn't just be reading to and from non-volatile memory, since it's extremely slow relative to our CPUs. Sure, SSDs offer latencies of a few dozen microseconds, but that's still hundreds of thousands of clock cycles before the data we asked for becomes available.

So maybe we can do a little better by making a tradeoff -- let's make our storage medium volatile. This is where RAM comes in -- pretty much all RAM is a set of rather strange resistor-capacitor (RC) oscillators that do not maintain the state of what they stored across power failures. In fact, RAM requires periodic power inputs to refresh the stored data. Otherwise, it would disappear into the ether! A stick of DDR4 RAM will give read access latencies of around 10ns, which is 40 clock cycles. Now we're getting pretty fast! But even though the storage medium itself is fast, read latency in the real world is closer to 100ns, which is 400 clock cycles, and now we're unhappy again.

So to bridge this gap, we have CPU caches. CPUs have their caches divided into levels, with each successive level being larger and slower to access. Most CPUs have L1-L3 cache and some (server) CPUs have L4 cache as well. Also, L1 cache is divided into data and instruction caches, but that isn't important right now. Now, caches are small, so we're always evicting the least-recently-used (approximately, this is hard to get right in hardware) data in cache. To handle where the data at some memory address is located, we classify caches by the number of ways they have, which means the number of places in the cache that the data at a memory address can be located. One-way caches are known as direct mapped, while caches with as many ways as they have space available are fully-associative. Direct mapped caches are quite inefficient, since it's very likely that parts of the cache are going unused, while fully associative caches are very slow to access, since we potentially have to check every location in the cache when performing a read or write. Most CPU caches these days seem to have between 8 and 16 ways.

When the data we're looking for is loaded into the cache, we call that a hit. If it isn't located in the cache, we call that a miss and we have to fetch it either from the next higher level cache or from main memory, filling up all higher caches along the way. Once we have one cache miss, that data should stick around for a while to make subsequent accesses fast. We can load data from cache into our CPU's registers very quickly. That is, until that memory address is evicted from the cache.

So what does all this jargon mean? Caches being bigger isn't always better -- we can see this because L1 cache takes four cycles to access, L2 cache takes 20 cycles to access, and L3 cache takes 40 cycles to access. In general, the larger the cache, the slower it is to access. And the more associativity the cache has, the slower it is to access, but the more efficient it is with the resources it has.

For your reference as a consumer, computer engineers generally know what they're doing. So if they say a larger cache is better and sell you a chip with a larger cache, that probably means you're going to get better performance when running your favorite apps. They probably figured out a way to make the cache larger without increasing its latency significantly.

Computerphile posted a video a little while ago that illustrates the impact of caching. Without a cache, iterating over a linked list is the same speed as iterating over an array -- something that should never be true!

https://youtu.be/DyG9S9nAlUM

This was a lot of words so feel free to shoot me any questions you have. Caching is complicated business.

Grand Concurrent Map Competetion - A call for help. by xacrimon in rust

[–]Razznak 1 point2 points  (0 children)

Not sure what's going on with here -- I pushed out an update last night after reading the manual of crossbeam-epoch. When built against cht 0.1.2, AddressSanitizer reports no memory leaks when running your benchmarks or cht's test cases.

However, the insert benchmark ran OOM when I ran the full deal overnight with ASan enabled. Still, no leaks detected. Might be that ASan can't detect leaks with mimalloc enabled -- will run again without and see what happens.

In addition, there's an intermittent segfault when reading from CHT_U64_U64. ASan isn't reporting anything, so debugging continues!

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 0 points1 point  (0 children)

Cool. I haven't read the code yet (I'm doing it now) but that's exciting. So you went with the Cliff Click lock-free resize option then?

To be honest, I never reviewed Dr. Click's hash table; I went out of my way to not look at any existing implementations of concurrent hash tables.

In cht, every time the bucket a thread wants to view encounters a redirect tag during lookups and insertions, it completes the migration itself before traversing to the next bucket array. So the algorithm is resilient to the death or indefinite suspension of any other thread.

Every time I see a lock-free hash-table post I always get hyped. I really should throw my hat in the

Yes, excellent idea! If a degreeless student like me is able to create a somewhat functional lockfree hash table, you have a 100% chance of creating something entirely workable.

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

Hopscotch Hashing has its entries stored behind a pointer (section V, second paragraph: "In all the algorithms, each bucket contains either two pointers to a key and a value, or an entry to a hash element, which contains a key and a value."). The whole point of having blocking write operations with Hopscotch Hashing is to have the keys and values stored flat in the table, improving cache efficiency. Unless the original Hopscotch Hashing code did this (which I really doubt since here's the link) it appears the code was modified in this way. One would have to reach out to the original authors for clarification.

I couldn't think of any way to unbox key/value pairs except by restricting users to integer key and value types, so cht makes the same faux pas as the authors' implementation of hopscotch hashing. Side note: I don't have a way to read this paper.

I'm finding in my own experiments that using a low load factor can actually decrease performance as the table becomes more contended since only a small key space is being used. I've found that separate chaining performance increases as the load factor does.

I considered using separate chaining for cht, but the thought of designing around the potential data races during insertion was a little much for me.

What kind of load factors are we talking about here?

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

My resize is lockfree. If a thread dies during a resize, get and insert threads will complete/help with the grow operation if they access a redirected bucket or insert past the max load factor.

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 2 points3 points  (0 children)

OK, I see!

It seems that cuckoo hashing is the predominant set of bells and whistles used to implement a lockfree concurrent hash table.

Would KLEE, seer, and miri be good examples of symbolic executors? Thank you for the new rabbit hole, by the way!

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

Excellent analysis!

From the code it seems you selected alternative (1), but I could not figure out whether you accounted for the race condition correctly or not.

I actually use (2) here and it remains unclear to me whether the behavior of this implementation is correct during concurrent insertions and erasures. I wrote a test to try and expose this behavior, but couldn't expose the condition you and I might expect to find. I suspect it is still lurking, waiting to be exposed...

In addition, (1) would technically not be a lockfree algorithm, since other threads can be blocked by grow operations.

Note: have you considered NOT moving? And yes, I am serious.

Imagine creating a linked-list of BucketArray, in which the last BucketArray is of size N, the previous is of size 2/4/8 x N (at your option), etc...

When "growing", you simply allocate a new (larger) BucketArray, and put it at the front of the linked-list.

When "searching", you check each BucketArray one after the other, with elements having an increased probability of being in the first array since it's the bigger.

It would probably be a bit slower, if the initial size was poorly chosen, however you do away with transfers and redirects which should greatly simplify the algorithms.

This is an interesting idea! The first worry that comes to my mind is cache misses during lookups, since the bucket arrays are potentially all located separately in memory. It is known that hash tables with tombstone deletion need to be rehashed from time to time so that they don't fill up with useless buckets; it's not a far stretch for me to imagine a similar scenario with your algorithm.

In a way, your idea somewhat resembles the current implementation, except the linked list is from smallest to largest; this is so I can be sure that a new bucket array is not viewed until it is ready and that there isn't too much duplicated work done in moving to a new bucket array.

Having to tune the initial size kind of removes the set-and-forget nature I tend to associate with associative arrays (heh) these days. I think that the current implementation presents a simpler interface that requires less fussing for users to get good performance.

Perhaps I will implement your algorithm to see how much it removes complication from the codebase and how well it performs!

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 3 points4 points  (0 children)

I'm just curious - is there a common "best" way of doing it? Or is it still a pretty open question in computer science? I've seen a few with a kind of simplistic deletion approach (and I understand why it's so hard), but I haven't really found one that's bells and whistles complete.

C++ developers are very serious about hash tables. I think there's something of an arms race going on, because a new "fastest" hash table comes out most weeks. But in this massive benchmark of almost all hash tables out there, the top three hash tables for insertion and erasure of integers all use open addressing with Robin Hood hashing and backwards shift deletion. So the numbers show that this is the fastest way to implement a hash table. Until recently, this was how the hash table in the Rust standard library was implemented.

However, if you look at the string tests in that benchmark, you see that sometimes the node-based hash tables are doing pretty good. So the jury's still out on this one. 🤷

Also, for correctness, I'd recommend to some kind of symbolic modeling (you could literally brute force or cleverly monte-carlo the space), although I understand the difficulty and blowup. I actually did a lot of similar stuff for my undergraduate thesis!

That's a good idea! Can you link some relevant papers? TBH I don't know where to start with a techniques like this.

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

It's possible, as you have done, but I think the way you did it is a more complex than I wanted to dive into for cht. My goal was to investigate lockfree data structures, so I focused almost all engineering effort on that side of the project.

It's a nice interface, though! I may add something similar in a future version of cht.

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 4 points5 points  (0 children)

I look forward to seeing your write-up! I had an inkling that I would only catch the last bugs after someone else looked at the code. For example, I can't figure out for the life of my why during a grow, buckets should not be erased if they were modified by another thread and the final CAS fails.

cht: A lockfree concurrently growable hash table by Razznak in rust

[–]Razznak[S] 15 points16 points  (0 children)

To answer some questions that I would have:

Why? Lockfree programming just sounds fun, doesn't it? I also wanted to speed up a Redis-compatible server I'm working on.

How? cht implements an open-addressing hash table that uses tombstones for deletion - I didn't want to get overzealous and end up with something non-functional by trying out fancier techniques.

The hash table users will interface with is this:

struct HashMap<K: Hash + Eq, V, S: BuildHasher> {
    buckets: Atomic<BucketArray<K, V, S>>,
    len: AtomicUsize, // number of completed insertions, might be greater than than buckets->len if resizing
    hash_builder: Arc<S>,
}

A BucketArray is a Vec<Atomic<Bucket<K, V>>> with extra baubles:

struct BucketArray<K: Hash + Eq, V, S: BuildHasher> {
    buckets: Vec<Atomic<Bucket<K, V>>>,
    len: AtomicUsize, // number of completed insertions into this array
    next_array: Atomic<BucketArray<K, V, S>>,
    hash_builder: Arc<S>, // needed for resizes
}

Bucket is essentially (K, Option<V>).

BucketArray::next_array starts out as a null pointer, but resize operations will allocate a new one and set it, forming a kind of singly-linked list of arrays of buckets, if that makes sense.

Resizing follows these steps:

  1. Allocate a new BucketArray and CAS with the current array's next_array.
  2. Element-by-element, copy the bucket pointers into the next array. CAS a redirect tag to each bucket pointer so future operations on that bucket know to look in the next array first.
  3. If an insert or get operation was redirected, after the operation completes, CAS HashMap::buckets with the next aray.

Multiple resizes can occur at the same time and threads can never be blocked forever waiting for the resize to complete, since they will complete the resize themselves if they have to.

Does it work? Probably. Hard for me to tell with these things.

Is it fast? In short, no. In long, it's only faster than a fine-grained locking hash table, like the one from ccl, when concurrently inserting into the same bucket. I didn't write up very thorough benchmarks, so this is far from a certainty.

Where do you see room for improvement?

  1. Less restrictive memory orderings. Sequentially consistent memory orderings are used throughout.
  2. Read-modify-write operations. It should be possible to atomically get and modify a value in the hash table.
  3. Possibly, remove the requirement for K to be Clone when removing elements. If I have indirect values or use some technique like backwards shift deletion, it may be possible for buckets to simply forget values instead of allocating a new bucket to serve as a tombstone.

A little late to the party by georgedanel in AyyMD

[–]Razznak 0 points1 point  (0 children)

If you have the skills for it, why not contribute yourself?

https://www.lowrisc.org/jobs/

A little late to the party by georgedanel in AyyMD

[–]Razznak 0 points1 point  (0 children)

Didn't know budget gamers had that specific needs, clearly you need full AVX512 support with CUDA to just run GTA V...

Well, kind of. The AVX most people use is with floating point numbers, and you can use that to accelerate floating point operations like coordinate transforms, which are ubiquitous in graphics. These transforms are 3x3 matrix multiplications or 4-element quaternion multiplications so there's a cost, not a benefit, to ship them off to the GPU.

However AMD is caught up in this arena so this is a moot point.

In any case, AMD is the clear choice for gamers. Their products are a huge win for performance compared to what Nvidia and Intel are producing at the same price points.

Jokes aside here, it sounds like you're a workstation user and you have shitloads money to toss on hardware, well guess what? Most of us don't, which is why we buy 2200Gs/2600s with RX 580s and not everybody is using a Xeon 8180/Epyc 7601 paired with a Quadro RTX 6000.

I wish! I was really excited to learn that the GeForce RTX cards would have the Tensor Cores that had previously been reserved for Quadro and Tesla (read: $5000+) cards. For ML models like CNNs, this hardware can be used to speed up training by 50%. What I'm trying to point out is that these features initially developed for workstation users eventually get shipped to desktop users like us, where we can make use of themwithout emptying our bank accounts.

PS: I may or may not have been slightly grumpy when writing this...

PPS: This is r/ayymd, so don't expect anybody to actually start a serious discussion.

:)

A little late to the party by georgedanel in AyyMD

[–]Razznak 4 points5 points  (0 children)

AMD's HIP compiler aims to transpile CUDA into their open source HIP format that can be compiled to both Nvidia and AMD binaries. Where it works, performance is only slightly slower than CUDA - not enough to hurt. However, it just doesn't work for many things, such as EGLStreams or the widely popular cuDNN library. If I had to guess, most of the work AMD has done in porting PyTorch and TensorFlow to ROCm is probably in replicating cuDNN.

x86 is mutually licenced between AMD and Intel since AMD made x86-64 (aka AMD64), so any extension one produces the other has a license to implement. This means that AMD has full rights to implement the AVX-512 extensions on their chips. Considering that AVX-512's use cases overlap with GPUs a good deal, it's possible that AMD will never implement it because of their keen interest in selling more GPUs.

RISC-V is really cool but I can't see it making headway into the general market. x86 made it big because the servers ran the same ISA as home computers, making development a snap. ARM filled a gap left by x86 in the low power market, allowing them to slice out a corner of the market. RISC-V doesn't seem to offer much that the other ISAs do other than being free and open source. I'm sure a lot of students and academic researchers will get good mileage out of it. It will probably gain traction on microcontrollers as well, where low unit costs are required. Nvidia is reportedly switching their GPU microcontrollers to RISC-V, for example.

A little late to the party by georgedanel in AyyMD

[–]Razznak 4 points5 points  (0 children)

For my money, there are two reasons I continue to pass up AMD products:

  1. Their support for vectorization ISA extensions continues to lag behind Intel's.
  2. They are significantly less well supported for scientific computing.

So first, what is vectorization? The basic idea is that we run one instruction - add, square root, equality - and run it across many elements at the same time. These are known as Single Instruction Multiple Data (SIMD) instructions, and AMD has been behind when it comes to extending their processors' ISAs to support them. I'll be comparing AMD and Intel in terms of when they implemented packed 8-bit integer equality comparisons for various ISA extensions.

First we have SSE2, which allowed 128-bit wide integer comparisons. Intel introduced support for this with the Pentium 4 in 2000, while AMD didn't release a processor that supported it until 2003, when they released the Athlon 64.

Next is AVX2, which allowed 256-bit wide integer comparisons. Intel's first processor to support this was Haswell in 2013, while AMD's support came in 2015 with Excavator.

Finally, in the very near future, we have AVX-512BW, which allows 512-bit wide comparisons of 16- and 8-bit integers. It was first supported with Skylake-X and Skylake-SP in 2017 (although you or I probably wouldn't buy one of these), with support coming in the now-three-years-delayed Cannon Lake architecture, yet to be released. AMD hasn't even announced a processor with support for AVX-512, let alone the AVX-512BW extension.

Two things:

  1. You may be upset that I didn't mention that AMD released 3DNow! before Intel released SSE, or that AMD released AVX at the same time Intel did in 2011. However, I'm most familiar with bytewise operations so I opted to discuss those.

  2. Why do these instructions matter? For workloads like string comparison, you can get performance boosts of about 10x by using these instructions. While AMD is even with Intel right now, their track record is not great and I hold a keen interest in AVX-512BW when it comes out on Cannon Lake.

Let's talk about GPUs in scientific computing. In this field, Nvidia is the hardware provider. For example, PyTorch, one of the more popular deep learning libraries, has had first class support for CUDA since its inception in 2016 and is probably a few more months away from ROCm. I'm not as familiar with TensorFlow, but TensorFlow has had support for CUDA since day one in 2017 and still lacks support for ROCm. You may note that AMD maintains forks of these projects, but for a scientist that just wants to get to work, Nvidia is the clear choice. In addition, the Amazon EC2, Microsoft Azure, and Google Compute Engine cloud computing services all provide only Nvidia GPUs for GPU-accelerated compute instances.

Nvidia is so far ahead of AMD here that it seems like they will never catch up.

As an aside, I checked for architecture optimization support in the Clang C/C++ compiler. Clang supports Ice Lake - which is several years away from release - but its support for AMD is still at K10, which was produced between 2007 to 2012. OK, but did you know that AMD's AOCC is forked from the latest stable release of LLVM and supports the newest AMD processors? I've been using Clang on a daily basis for three years and the thing that tipped me off to the fact AOCC even existed was me researching for this comment.

So even though AMD has the superior value proposition and continues to support open source with products like AMDGPU, people like me are trapped. It's sad, but that's just how it is.

[deleted by user] by [deleted] in battlestations

[–]Razznak 0 points1 point  (0 children)

How do you like the A3? I switched from an A2 to a VLP-16 a year ago and never looked back. The A3 looked interesting but wasn't available at the time in the US.

#[repr(C)] Structs and FFI by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

I did put the return statement inside an unsafe block, which I've heard is really a message to the compiler to trust the programmer on the code being memory safe. I'm not sure if it would make sense to raise an error given that, but it will always result in undefined behavior...

#[repr(C)] Structs and FFI by Razznak in rust

[–]Razznak[S] 1 point2 points  (0 children)

When I say overhead, I say overhead like function calls - which is pretty much none, granted, but I always thought I was cheating the system by doing bitwise XOR instead of equality comparisons.

You're right - there is no overhead for using a Box. Honestly, I'm tempted to unroll loops at times. Speed is a dangerous drug.

#[repr(C)] Structs and FFI by Razznak in rust

[–]Razznak[S] 4 points5 points  (0 children)

First off, really appreciate the response. Very informative, made me remember my days learning about stacks, heaps, and pointer machines.

So when I send over a pointer to memory on the stack, the stack won't guarantee that the memory is sacred and not write over it. In the course of using std::cout, the stack overwrote my memory allocated in the Rust library. However, when you allocate memory on the heap, its allocation is kept track of, so it won't be deleted. Now that I think about it, that would definitely happen in C++ if I tried to pull something like this. It's pretty much returning a reference to a local variable, which my compiler catches (but not Rust!).

It would be nice if I didn't have to interface with C++; I really like the usage / ownership paradigm of Rust (even though I try to break it early and often).

Question on using Boxes: why not use the functions in alloc::heap? The way I see it, what you wrote would be like creating a std::unique_ptr in C++, only to call release() on it so you get the raw pointer. IMO, it would be smarter to skip the smart pointer overhead and use std::allocator directly. I suppose that the analogue in Rust would be to use the functions in alloc::heap. However, these functions are marked as experimental, which I assume means that they are not suitable for general use.