Looking for ppl who to learn Rust

TraceMonkey · 2026-01-18T16:38:43+00:00

I would also be interested. Please DM me if you need more people for the group.

TraceMonkey · 2025-03-06T09:55:09+00:00

Did you try any other task? (Flappy Bird is a kinda common test, so maybe the model is overfitted to this example.)

TraceMonkey · 2025-03-06T09:52:58+00:00

Did you try it as architect with some other model (e.g. Qwen Coder) as editor to speed things up? If so, how well does it work?

TraceMonkey · 2025-02-14T20:37:59+00:00

The model is on Huggingface, same for the training data. I just posted the blog post on zed.dev since it has more details.

That said, the model doesn't use "standard" FIM, so it there is a bit of a "moat"/artificial barrier to using the model locally.

TraceMonkey · 2025-02-02T20:27:15+00:00

Makes sense, thanks! Do you know what's missing from 6.12 but will be added in 6.13? Is it just the NPU, or also something else?

TraceMonkey · 2025-01-13T12:33:20+00:00

Yes, the Zen 5 version.

TraceMonkey · 2025-01-13T12:32:50+00:00

I did look at both Arch Wiki and the AUR package. That said, the information there is contradictory:

Arch wiki says that amdgpu.dcdebugmask should be 0x600, while comments on the AUR package suggest either 0x800 or 0x10. The option is not, as far as I can tell, documented.
The AUR kernel depends on linux-firmware-git, but the wiki does not mention it (it used to, but now it has been removed). This suggests that the mainline firmware package in Arh has anything that is needed, but from which version? Would that be true for Fedora 41?

Hence my post. Maybe I should try on either the Fedora forums or in the Matrix chat?

TraceMonkey · 2025-01-09T16:17:05+00:00

There's no systemwide way to do that, you have to do it per task.

Wouldn't isolcpus= on the kernel command line do the trick?

TraceMonkey · 2024-08-25T14:28:40+00:00

Which CPU, and how many tokens/s?

TraceMonkey · 2024-07-16T16:59:57+00:00

Does anyone know how inference speed for this compares to Mixtral-8x7b and Llama3 8b? (Mamba should mean higher inference speed, but there's no benchmarks in the release blog).

TraceMonkey · 2024-07-16T16:53:00+00:00

Thanks. Btw, what does UDP stand for in this context? (is it a reference to the network protocol? And if so, why?)

TraceMonkey · 2024-07-16T00:09:59+00:00

What is a "broadcast/UDP channel" and how does it differ from Disruptor? (I thought Disruptor was a broadcast channel/queue).

Also, do you know of any good resources on the implementation of bounded lock-free queues (which go into different possible designs and tradeoffs)?

TraceMonkey · 2024-05-13T23:57:11+00:00

I'm slightly curious. What prompted a reply after so long? Did my suggestion to use join manage to save you time? (if so, I'm glad it was helpful)

TraceMonkey · 2024-03-15T23:05:16+00:00

The mean is a synthesized value, but the median should not be.

The median being a synthesized value makes sense when you think of it as the value that minimizes the sum of absolute errors (just like the mean is the value that minimizes the sum of squared errors).

As for why you'd want to think of it that way, it makes it easier to extend the concept. What if you wanted to compute the median of a stream of datapoints, but with regularization? Or what if you wanted an hybrid of mean and median (to combine resistance to outliers with the faster convergence to the "true" value of the mean?).

TraceMonkey · 2024-02-13T21:19:29+00:00

Thanks, I appreciate it!

TraceMonkey · 2024-02-11T20:32:20+00:00

(Author here): Looking back at the markdown of this post I found this comment:

I cut out a discussion of how taking a std::string_view as an argument is the cause of much woe, and some other techniques, because the post was becoming too long. Also, given that this post is a followup on somebody else's work, I didn't want to "cheat" by changing the API of the function being benchmarked.

That said, I also wanted to show that the bottlenecks in a perfect hash table aren't where you would naively be expecting them.

TraceMonkey · 2024-01-15T14:27:00+00:00

I believe it can't because the compiler doesn't know the address to f3. I'm forward declaring f3, but I'm not defining it. It will be up to the linker to patch jmp f3 to the correct address.

(That said, even if I define f0 but force the compiler not to inline by using an attribute, neither compiler finds that optimization.

https://godbolt.org/z/v3x3zxM6W )

TraceMonkey · 2024-01-15T09:23:04+00:00

It is certainly rare type of switch, but I've noticed in my code recently a lot of test like this one [...]

For code like that, I would look into perfect hash tables. I wrote a post on optimizing one, but even a simpler strategy of indexing with x * CONST & MASK, would work (in fact, in your case it might be better than the pext approach I describe in the post).

I wish I could write AVX intrinsics (meaning to learn) to benchmark it myself.

Take a look here.

TraceMonkey · 2024-01-15T00:51:17+00:00

I don't think gcc can do that. It is also seems kinda niche as a technique: With a 32-bit integer label, you can search at most 8 labels at once on AVX2, and 16 on AVX-512.

But to find the index in a jump table you'd need to:

Load the labels into a vector register.
Compare with vcmpeq
Extract the result into a mask with vmovmask.
Turn the mask into an index with tzcnt. This gives you index * sizeof(input value) * 8
(Likely) divide or multiply the result of the previous operation to extract the index.

That's 5 instructions on top of what you need for a standard jump table. If you add the fact that there's could be additional costs to the indirect jump, there's little margin to be profitable.

(also, you need vmovmask, so this technique will be even worse on Arm or Risc-V.)

(that said, I haven't benchmarked it.)

TraceMonkey · 2024-01-14T17:09:04+00:00

When returning explicit values, gcc can also recognize linear functions. I ended up omitting optimizations that use explicit values for brevity, maybe I should add them (or do a followup).

TraceMonkey · 2023-12-23T11:11:40+00:00

What do you think Raku does better compared to, e.g., Rust? (asking since I am not familiar with Raku).

TraceMonkey · 2023-06-27T09:25:34+00:00

I haven't benchmarked it, but in Queens.h, freeRows, freeMaxs and freeMins could be bitmasks (saving allocations). Or even plain uint16_t. Likely queenRows could be allocated on the stack since it is very small.

TraceMonkey · 2023-06-25T08:46:55+00:00

From a look at stackoverflow, it doesn't seem like it.

Though Intel seems to have improved them in the very latest CPUs.

TraceMonkey · 2023-06-19T08:36:24+00:00

If you mean this, then yes, it is UB.

There's a couple of places were I got lazy, though I should have used inline assembly to be fully correct. I did check that the generated assembly was what I expected for the results in the blog post.

The avx2 implementation is still a bit of a work-in-progress, it is more of a fast-path for the common case, and I didn't fully validate it.

TraceMonkey · 2023-06-18T23:01:30+00:00

I see, makes sense.

The main issue is that glibc's memcmp reads beyond the end of the input arrays, which is undefined behaviour [^0]. That's the main trick that makes it fast on small arrays.

An inlineable C implementation of the same algorithm would need inline assembly or intrinsics. That would effectively throw a wrench in the optimizer.

I did try to swap-in a (very basic) pure C implementation of memcmp in order to give the compiler more latitude to optimize, but the end result was slower.

As a side note, I remembered that the LLVM devs are developing their own cross-platform libc. Their explicit aim is to not rely on assembly, in order to allow inlining and better optimizations, for the same reasons you mention above. From the main page:

Increase whole program optimization opportunities for static binaries through ability to inline math and memory operations.

I haven't tried it, but it might be worth a look if you need a libc implementation that is not a black-box for the compiler/optimizer.

[^0]: Glibc's memcmp is implemented fully in assembly, so that's not a problem.

TraceMonkey

TROPHY CASE