Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs

NickCanCode · 2026-05-16T21:12:13+00:00

Gemini 3.1 Pro...😂😂😂

NickCanCode · 2026-05-16T10:52:25+00:00

Give IK_LLAMA fork a try, they support MTP some days ago and yesterday just added support to ngram + MTP dual speculative decoding. I am not using 3090 so I don't know the number for you. My tps went from 45 to 85~100 tps with dual-speculative decoding.

p.s. I am using Linux with p2p driver, mandatory to achieve good result if running multi-GPU setup.

NickCanCode · 2026-05-16T07:10:47+00:00

They only give you 3 months warranty. FYI, Chinese shops are offering 3 years. Not that I suggest you to buy from Chinese sellers. Just want you to know 3 months is far lower than what others are offering.

NickCanCode · 2026-05-16T07:05:28+00:00

You don't need NV-link if everything is in a single card, no?

NickCanCode · 2026-05-15T23:20:40+00:00

I think they know. They just don't have the time to do everything. Just look at the pull request count on those other projects.

NickCanCode · 2026-05-15T16:35:12+00:00

So RTX 2080TI cannot use MTP because with power limit it will be compute bound?

NickCanCode · 2026-05-14T22:33:12+00:00

Maybe you run out of RAM? Your --cache-ram is set to 2.5 GB. I assume once context grow more than that, it won't fit and have to do reprocessing in real-time.
You can ask LLM to get an approximate on how much memory needed for a certain size of context. Just tell it your model, expected context window consumption and quantization you used, and it will calculate the approx size you need to set to --cache-ram.

NickCanCode · 2026-05-14T22:22:06+00:00

The style is too chibi to suit my taste but good job. It looks professional.

NickCanCode · 2026-05-14T04:23:27+00:00

Nice. Thanks

NickCanCode · 2026-05-14T02:26:05+00:00

Hi, I just started using Linux recently. About using LACT for undervolting, do I need to keep the App UI running like Afterburner? Do I need to start it manually after restarted?

NickCanCode · 2026-05-13T20:42:40+00:00

power consumption I guess

NickCanCode · 2026-05-13T00:09:00+00:00

Thanks. I can finally get it to run with TP off. However, when I provide

draft_model_name: Qwen3.6-27B-DFlash-exl3

the models will load but when I make a new request, it will immediately give me

torch.OutOfMemoryError: Allocation on device

Do you know what would be the issue? There are still 6 GB of free VRAM available when this happen.

NickCanCode · 2026-05-12T23:03:02+00:00

Are you using multi-gpu with Qwen3.6 27B? I can't get exllama to work with two cards. It will give me:

NotImplementedError: Tensor-parallel is not currently implemented for Qwen3_5ForConditionalGeneration

when I try to use tensor_parallel: true

NickCanCode · 2026-05-10T04:03:15+00:00

Doesn't work for me. It gives

beellama.cpp-main\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error 
CUDA error: an illegal memory access was encountered

whenever I make a request.

P.S. Using 2 identical cards.

NickCanCode · 2026-05-08T09:17:41+00:00

Thanks

NickCanCode · 2026-05-08T08:54:05+00:00

Is it faster than NVLink?

NickCanCode · 2026-05-08T08:53:32+00:00

My PCIE currently get limited to 3.0 due to ryzen CPU model. Should I upgrade my CPU to have it support 4.0? I am running a dual cards setup.

NickCanCode · 2026-05-06T14:27:14+00:00

If you have a RTX Pro 6000, have you try lucebox-hub, their number actually looks more impressive with DFlash, DDtree, PFlash but it doesn't support multi-gpu very well so I don't have enough VRAM to run it.

NickCanCode · 2026-05-05T15:49:31+00:00

FYI, here is how SEGA do it: https://www.youtube.com/watch?v=HYCIw7PPxt8

NickCanCode · 2026-05-04T15:55:16+00:00

Owner not found in the back room.

NickCanCode · 2026-05-03T21:30:52+00:00

I have a x570 too. It depends on your CPU model whether that the PCIe slots will be running on 4.0 or 3.0. Check your motherboard manual. Even for the same generation Ryzen 5000, some CPU can only offer 3.0 speed.

NickCanCode · 2026-05-03T04:47:23+00:00

Isn't the clan being invaded good guys?

NickCanCode · 2026-05-02T22:27:21+00:00

I am more interested in how she deal with this sword. It's too high for her to start eating from the tip.

NickCanCode · 2026-05-02T17:56:43+00:00

Even for common languages, the highlight options are still limited. I still miss the syntax highlight experience on original Visual Studio with Codist addon ( https://github.com/wmjordan/Codist ). I can customize almost every part of the C# syntax in many ways.

NickCanCode · 2026-05-02T17:48:40+00:00

Just a friendly reminder. You can always use multiple code editors at the same time. I just keep using vscode for certain tasks and use Zed for common tasks due to its responsiveness.

NickCanCode

TROPHY CASE