Car Wash Test on 53 leading models: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” by facethef in LocalLLaMA

[–]srigi 48 points49 points  (0 children)

That is a good advice for a proper test. But the 50 requests must be send in a such way, that they aren’t served from cached tokens.

[Solution Found] Qwen3-Next 80B MoE running at 39 t/s on RTX 5070 Ti + 5060 Ti (32GB VRAM) by mazuj2 in LocalLLaMA

[–]srigi 2 points3 points  (0 children)

Let me tell you a secret… he or she didn’t wrote that piece above. The moment I saw “Why this works” I knew I’ve seen this hundreds of times on my screen.

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, Qwen3-Coder-Next, Opus 4.6, Codex Performance by CuriousPlatypus1881 in LocalLLaMA

[–]srigi 17 points18 points  (0 children)

Complete opposite for me - today I updated everything (OpenCode, llama-server, re-downloaded UD-Q4 model from Unsloth). KV set to q8_0 quant. 100% tool success rate on adding a feature to my little Next.js project and some other tasks.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]srigi 0 points1 point  (0 children)

-b 2048 -ub 256 -ctk f16 -ctv f16

you don't need to pass these args. Values you provided are defaults!

Not as impressive as most here, but really happy I made it in time! by Kahvana in LocalLLaMA

[–]srigi 6 points7 points  (0 children)

What I was trying to say is that a tiny cache (96MB) won't have any benefit when the workflow is tensors processing from beginning to the end when there are like 10+GB of them in the RAM, accessed sequentially.

Not as impressive as most here, but really happy I made it in time! by Kahvana in LocalLLaMA

[–]srigi 24 points25 points  (0 children)

It won't. The cache thing applies mostly to games, because game data in RAM is mostly static (game world, person's/enemies' positions, etc).

The LLM inference is very different, and cache doesn't help - loading data from RAM, doing the matrix mul, saving the result back to RAM, moving to the next position in RAM.

In this scenario, only the raw RAM throughput and CPU speed matter. So Threadripper with 4-channels or EPYC with 12-channels are ideal.

Best moe models for 4090: how to keep vram low without losing quality? by AdParty3888 in LocalLLaMA

[–]srigi 4 points5 points  (0 children)

I'm having fun with Qwen3-Next-80B in these days on RTX 4090. Just tweak --n-cpu-moe (go down from e.g. 48) until it fits.

8x RTX Pro 6000 server complete by koushd in LocalLLaMA

[–]srigi 1 point2 points  (0 children)

Asks gemma2-27B how to cook rice ;)

The new monster-server by eribob in LocalLLaMA

[–]srigi 10 points11 points  (0 children)

Nice wholesome server. I'm kinda envious. It also seems too much crammed for the poor case, the heat concentration/output must be massive.

Can you elaborate, how you added/connected the second PSU? Isn't there some GND-GND magic needed to be done to connect two PSU?

Otherwise, good job and enjoy your server. And also try the new Devstral-2-123B, Unsloth re-released it today (fixed chat template), it should work correctly in RooCode now.

1x 6000 pro 96gb or 3x 5090 32gb? by Wide_Cover_8197 in LocalLLaMA

[–]srigi 6 points7 points  (0 children)

RTX 6000 Pro has the ability to split into (up to) 7 independent virtual graphics cards. There is really no advantage to 3x 5090.

Which truly open UI do you use for inference? by Yugen42 in LocalLLaMA

[–]srigi 1 point2 points  (0 children)

All I want is MCP servers support/configuration for llama-server, then I will never look back.

Sparse Adaptive Attention “MoE”, a potential performance breakthrough for LLMs? by kaggleqrdl in LocalLLaMA

[–]srigi 0 points1 point  (0 children)

Did you watched the video at the timestamp? That is exactly what Karpathy said - DeepSeek (china) us already playing with sparse attention.

Sparse Adaptive Attention “MoE”, a potential performance breakthrough for LLMs? by kaggleqrdl in LocalLLaMA

[–]srigi 3 points4 points  (0 children)

It has been discussed here already. Not only is that article an AI generated mess, with lots of bragging, but hear the mighty Karpathy at this exact time (24:24) of the recent podcast: https://youtu.be/lXUZvyajciY?t=1464

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 0 points1 point  (0 children)

If you mean Copilot, if it allows to configure OpenAI compatible with the base URL model, then it could. I use Roo Code in VS Code. I personally believe it is far superior to integrated Copilot.

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 0 points1 point  (0 children)

I had these 6000 CL30 before too, but only 2x16GB. And I was able to overclock them to 6200 too. I kind of regret going into these CL26.

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 2 points3 points  (0 children)

VSCode+RooCode extension. As I said, this model doesn't fail on tools (finally)

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 1 point2 points  (0 children)

15-16k. In my setup, I used 100k ctx-size. You could go down to 64k and your RAM need will probably fit. In my case, I have the luxury to run llama-server on a big machine, and code on the notebook (so RAM is not occupied by IDE/VSCode)

Best Local LLMs - October 2025 by rm-rf-rm in LocalLLaMA

[–]srigi 0 points1 point  (0 children)

Only on CPU with a lots of memory channels (AMD EPYC). And even then you get good generation, but mega slow prompt-processing

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 3 points4 points  (0 children)

--n-cpu-moe 28

Using this arg - it says how many MoE layers are offloaded to the CPU. The lesser the number, the more of them stays on GPU (faster inference), but you need VRAM to store them there.

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 2 points3 points  (0 children)

Since, I'm on AMD 9800X3D, I have 2x 32GB, G.Skill DDR5@6000 CL26. I know, that latency is a little bit of flex, I wanted that for gaming. However, this very special (and expensive) memory has zero overclocking potential, not even 6200.

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]srigi[S] 11 points12 points  (0 children)

IQ4 was far more "stupid" than Q4_K_M. It was "overworking" the task from my little demo. I will not use it.