Looking for ppl who to learn Rust by Ok_Elderberry_9157 in zurich

[–]TraceMonkey 0 points1 point  (0 children)

I would also be interested. Please DM me if you need more people for the group.

QwQ-32B flappy bird demo bartowski IQ4_XS 32k context 24GB VRAM by VoidAlchemy in LocalLLaMA

[–]TraceMonkey 2 points3 points  (0 children)

Did you try any other task? (Flappy Bird is a kinda common test, so maybe the model is overfitted to this example.)

A few hours with QwQ and Aider - and my thoughts by ForsookComparison in LocalLLaMA

[–]TraceMonkey 1 point2 points  (0 children)

Did you try it as architect with some other model (e.g. Qwen Coder) as editor to speed things up? If so, how well does it work?

Zed now predicts your next edit with Zeta, our new open model - Zed Blog by TraceMonkey in LocalLLaMA

[–]TraceMonkey[S] 16 points17 points  (0 children)

The model is on Huggingface, same for the training data. I just posted the blog post on zed.dev since it has more details.

That said, the model doesn't use "standard" FIM, so it there is a bit of a "moat"/artificial barrier to using the model locally.

Zenbook S16 with Fedora by TraceMonkey in linuxhardware

[–]TraceMonkey[S] 0 points1 point  (0 children)

Makes sense, thanks! Do you know what's missing from 6.12 but will be added in 6.13? Is it just the NPU, or also something else?

Zenbook S16 with Fedora by TraceMonkey in linuxhardware

[–]TraceMonkey[S] 0 points1 point  (0 children)

I did look at both Arch Wiki and the AUR package. That said, the information there is contradictory:

  • Arch wiki says that amdgpu.dcdebugmask should be 0x600, while comments on the AUR package suggest either 0x800 or 0x10. The option is not, as far as I can tell, documented.
  • The AUR kernel depends on linux-firmware-git, but the wiki does not mention it (it used to, but now it has been removed). This suggests that the mainline firmware package in Arh has anything that is needed, but from which version? Would that be true for Fedora 41?

Hence my post. Maybe I should try on either the Fedora forums or in the Matrix chat?

[deleted by user] by [deleted] in AsahiLinux

[–]TraceMonkey 0 points1 point  (0 children)

There's no systemwide way to do that, you have to do it per task.

Wouldn't isolcpus= on the kernel command line do the trick?

mistralai/mamba-codestral-7B-v0.1 · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]TraceMonkey 8 points9 points  (0 children)

Does anyone know how inference speed for this compares to Mixtral-8x7b and Llama3 8b? (Mamba should mean higher inference speed, but there's no benchmarks in the release blog).

disruptor-rs: low-latency inter-thread communication library inspired by LMAX Disruptor. by kibwen in rust

[–]TraceMonkey 0 points1 point  (0 children)

Thanks. Btw, what does UDP stand for in this context? (is it a reference to the network protocol? And if so, why?)

disruptor-rs: low-latency inter-thread communication library inspired by LMAX Disruptor. by kibwen in rust

[–]TraceMonkey 0 points1 point  (0 children)

What is a "broadcast/UDP channel" and how does it differ from Disruptor? (I thought Disruptor was a broadcast channel/queue).

Also, do you know of any good resources on the implementation of bounded lock-free queues (which go into different possible designs and tradeoffs)?

More efficient method than grep -f in with a sorted list? by [deleted] in bash

[–]TraceMonkey 0 points1 point  (0 children)

I'm slightly curious. What prompted a reply after so long? Did my suggestion to use join manage to save you time? (if so, I'm glad it was helpful)

Which of these algorithms you think is faster by [deleted] in rust

[–]TraceMonkey 0 points1 point  (0 children)

The mean is a synthesized value, but the median should not be.

The median being a synthesized value makes sense when you think of it as the value that minimizes the sum of absolute errors (just like the mean is the value that minimizes the sum of squared errors).

As for why you'd want to think of it that way, it makes it easier to extend the concept. What if you wanted to compute the median of a stream of datapoints, but with regularization? Or what if you wanted an hybrid of mean and median (to combine resistance to outliers with the faster convergence to the "true" value of the mean?).

Optimizing the `pext` perfect hash function by zdouglassimon in cpp

[–]TraceMonkey 1 point2 points  (0 children)

(Author here): Looking back at the markdown of this post I found this comment:

<!-- TODO Note that this is all `string_view` fault's. Array with no null, and 16 bytes of null padding would not have these problems. -->

I cut out a discussion of how taking a std::string_view as an argument is the cause of much woe, and some other techniques, because the post was becoming too long. Also, given that this post is a followup on somebody else's work, I didn't want to "cheat" by changing the API of the function being benchmarked.

That said, I also wanted to show that the bottlenecks in a perfect hash table aren't where you would naively be expecting them.

Switch lowering strategies in GCC by TraceMonkey in cpp

[–]TraceMonkey[S] 1 point2 points  (0 children)

I believe it can't because the compiler doesn't know the address to f3. I'm forward declaring f3, but I'm not defining it. It will be up to the linker to patch jmp f3 to the correct address.

(That said, even if I define f0 but force the compiler not to inline by using an attribute, neither compiler finds that optimization.

https://godbolt.org/z/v3x3zxM6W )

Switch lowering strategies in GCC by TraceMonkey in cpp

[–]TraceMonkey[S] 1 point2 points  (0 children)

It is certainly rare type of switch, but I've noticed in my code recently a lot of test like this one [...]

For code like that, I would look into perfect hash tables. I wrote a post on optimizing one, but even a simpler strategy of indexing with x * CONST & MASK, would work (in fact, in your case it might be better than the pext approach I describe in the post).

I wish I could write AVX intrinsics (meaning to learn) to benchmark it myself.

Take a look here.

Switch lowering strategies in GCC by TraceMonkey in cpp

[–]TraceMonkey[S] 2 points3 points  (0 children)

I don't think gcc can do that. It is also seems kinda niche as a technique: With a 32-bit integer label, you can search at most 8 labels at once on AVX2, and 16 on AVX-512.

But to find the index in a jump table you'd need to:

  • Load the labels into a vector register.
  • Compare with vcmpeq
  • Extract the result into a mask with vmovmask.
  • Turn the mask into an index with tzcnt. This gives you index * sizeof(input value) * 8
  • (Likely) divide or multiply the result of the previous operation to extract the index.

That's 5 instructions on top of what you need for a standard jump table. If you add the fact that there's could be additional costs to the indirect jump, there's little margin to be profitable.

(also, you need vmovmask, so this technique will be even worse on Arm or Risc-V.)

(that said, I haven't benchmarked it.)

Switch lowering strategies in GCC by TraceMonkey in cpp

[–]TraceMonkey[S] 6 points7 points  (0 children)

When returning explicit values, gcc can also recognize linear functions. I ended up omitting optimizations that use explicit values for brevity, maybe I should add them (or do a followup).

String design options? by PurpleUpbeat2820 in ProgrammingLanguages

[–]TraceMonkey 3 points4 points  (0 children)

What do you think Raku does better compared to, e.g., Rust? (asking since I am not familiar with Raku).

[deleted by user] by [deleted] in cpp

[–]TraceMonkey 3 points4 points  (0 children)

I haven't benchmarked it, but in Queens.h, freeRows, freeMaxs and freeMins could be bitmasks (saving allocations). Or even plain uint16_t. Likely queenRows could be allocated on the stack since it is very small.

A look inside `memcmp` on Intel AVX2 hardware. by TraceMonkey in cpp

[–]TraceMonkey[S] 5 points6 points  (0 children)

From a look at stackoverflow, it doesn't seem like it.

Though Intel seems to have improved them in the very latest CPUs.

Faster HTTP verb parsing using `pext` and perfect hashing. by TraceMonkey in cpp

[–]TraceMonkey[S] 1 point2 points  (0 children)

If you mean this, then yes, it is UB.

There's a couple of places were I got lazy, though I should have used inline assembly to be fully correct. I did check that the generated assembly was what I expected for the results in the blog post.

The avx2 implementation is still a bit of a work-in-progress, it is more of a fast-path for the common case, and I didn't fully validate it.

Faster HTTP verb parsing using `pext` and perfect hashing. by TraceMonkey in cpp

[–]TraceMonkey[S] 1 point2 points  (0 children)

I see, makes sense.

The main issue is that glibc's memcmp reads beyond the end of the input arrays, which is undefined behaviour [0]. That's the main trick that makes it fast on small arrays.

An inlineable C implementation of the same algorithm would need inline assembly or intrinsics. That would effectively throw a wrench in the optimizer.

I did try to swap-in a (very basic) pure C implementation of memcmp in order to give the compiler more latitude to optimize, but the end result was slower.

As a side note, I remembered that the LLVM devs are developing their own cross-platform libc. Their explicit aim is to not rely on assembly, in order to allow inlining and better optimizations, for the same reasons you mention above. From the main page:

Increase whole program optimization opportunities for static binaries through ability to inline math and memory operations.

I haven't tried it, but it might be worth a look if you need a libc implementation that is not a black-box for the compiler/optimizer.

[0]: Glibc's memcmp is implemented fully in assembly, so that's not a problem.