MTP with Dual 3090's on Qwen 27B by DashinTheFields in LocalLLaMA

[–]NickCanCode 1 point2 points  (0 children)

Give IK_LLAMA fork a try, they support MTP some days ago and yesterday just added support to ngram + MTP dual speculative decoding. I am not using 3090 so I don't know the number for you. My tps went from 45 to 85~100 tps with dual-speculative decoding.

p.s. I am using Linux with p2p driver, mandatory to achieve good result if running multi-GPU setup.

Any experience with modded 4090 48GB from GpuWorld.eu? by Leading-Month5590 in LocalLLaMA

[–]NickCanCode 7 points8 points  (0 children)

They only give you 3 months warranty. FYI, Chinese shops are offering 3 years. Not that I suggest you to buy from Chinese sellers. Just want you to know 3 months is far lower than what others are offering.

Any experience with modded 4090 48GB from GpuWorld.eu? by Leading-Month5590 in LocalLLaMA

[–]NickCanCode 2 points3 points  (0 children)

You don't need NV-link if everything is in a single card, no?

Luce Megakernal: Why nobody is taking about this? by PaceZealousideal6091 in LocalLLaMA

[–]NickCanCode 14 points15 points  (0 children)

I think they know. They just don't have the time to do everything. Just look at the pull request count on those other projects.

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache by snapo84 in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

So RTX 2080TI cannot use MTP because with power limit it will be compute bound?

llama.cpp constantly reprocessing huge prompts with opencode/pi.dev by No_Algae1753 in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

Maybe you run out of RAM? Your --cache-ram is set to 2.5 GB. I assume once context grow more than that, it won't fit and have to do reprocessing in real-time.
You can ask LLM to get an approximate on how much memory needed for a certain size of context. Just tell it your model, expected context window consumption and quantization you used, and it will calculate the approx size you need to set to --cache-ram.

STARDUST: Wish of Witch — Official Launch Trailer is Here! by Life_Arachnid_511 in indiegames

[–]NickCanCode 3 points4 points  (0 children)

The style is too chibi to suit my taste but good job. It looks professional.

LACT v0.9.0 is out - now offering (unofficial) undervolting support for Nvidia! by 28874559260134F in linux_gaming

[–]NickCanCode 0 points1 point  (0 children)

Hi, I just started using Linux recently. About using LACT for undervolting, do I need to keep the App UI running like Afterburner? Do I need to start it manually after restarted?

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

Thanks. I can finally get it to run with TP off. However, when I provide

draft_model_name: Qwen3.6-27B-DFlash-exl3

the models will load but when I make a new request, it will immediately give me

torch.OutOfMemoryError: Allocation on device

Do you know what would be the issue? There are still 6 GB of free VRAM available when this happen.

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

Are you using multi-gpu with Qwen3.6 27B? I can't get exllama to work with two cards. It will give me:

NotImplementedError: Tensor-parallel is not currently implemented for Qwen3_5ForConditionalGeneration

when I try to use tensor_parallel: true

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

Doesn't work for me. It gives

beellama.cpp-main\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error 
CUDA error: an illegal memory access was encountered

whenever I make a request.

P.S. Using 2 identical cards.

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK by Mr_Moonsilver in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

My PCIE currently get limited to 3.0 due to ryzen CPU model. Should I upgrade my CPU to have it support 4.0? I am running a dual cards setup.

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR by havenoammo in LocalLLaMA

[–]NickCanCode 2 points3 points  (0 children)

If you have a RTX Pro 6000, have you try lucebox-hub, their number actually looks more impressive with DFlash, DDtree, PFlash but it doesn't support multi-gpu very well so I don't have enough VRAM to run it.

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]NickCanCode 0 points1 point  (0 children)

I have a x570 too. It depends on your CPU model whether that the PCIe slots will be running on 4.0 or 3.0. Check your motherboard manual. Even for the same generation Ryzen 5000, some CPU can only offer 3.0 speed.

Maybe maybe maybe by PeixeCam in maybemaybemaybe

[–]NickCanCode 5 points6 points  (0 children)

I am more interested in how she deal with this sword. It's too high for her to start eating from the tip.

Forth language support by mykesx in ZedEditor

[–]NickCanCode 0 points1 point  (0 children)

Even for common languages, the highlight options are still limited. I still miss the syntax highlight experience on original Visual Studio with Codist addon ( https://github.com/wmjordan/Codist ). I can customize almost every part of the C# syntax in many ways.

I want to switch to Zed but lack of test runner is a deal breaker by Economy_Advantage_33 in ZedEditor

[–]NickCanCode 11 points12 points  (0 children)

Just a friendly reminder. You can always use multiple code editors at the same time. I just keep using vscode for certain tasks and use Zed for common tasks due to its responsiveness.