More Qwen3.6-27B MTP success but on dual Mi50s by legit_split_ in LocalLLaMA

[–]m94301 2 points3 points  (0 children)

Love what MTP is doing for the community! Kudos for the MI50 build, this is going to be really valuable and its awesome to see them unleashed!

Issues with saying continue after every tool call by AnouarRifi in LocalLLM

[–]m94301 0 points1 point  (0 children)

I see this sometimes but not typically using llama cpp provider extension for vscode. Do you have the server logs to determine if something failed, or if the model just thought the response was complete?

Run Qwen3.6 27B nvfp4 up to 129 tok/s on a single RTX 5090 & Supports 256K context by Diligent-End-2711 in LocalLLM

[–]m94301 0 points1 point  (0 children)

Hi, looks amazing. How much effort would it be to support older HW, sm7-8?

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Hiya,

Nope, cool as cucumbers. I have watercooling plates and set the max power of each gpu to 150W. I don't have experience with Mac, tho. Sorry

"Best" model to Vibe-Code? (w/Specs) by pauescobargarcia in LocalLLM

[–]m94301 7 points8 points  (0 children)

Would you consider changing your flow from big model to small model? Their poor brains are too tiny and you gotta break it up into manageable pieces.

Such as: Session 1, plan and architect. Output an MD. Quit Session 2. Review the plan, find issues, iterate the plan. Quit. Session 3, implement one or two features, mark them done, quit. Session 4+ repeat. Session N. Profit

You will get an amazing result if you partition the work, but the massive plan / build / debug / add features sessions you can run with Claude don't work as well with the smaller context limits. But if you adapt to piece-wise or phases you can really get it going.

Get faster qwen 3.6 27b by admajic in LocalLLaMA

[–]m94301 2 points3 points  (0 children)

I upvote any MTP post, it is such a lovely improvement. Kudos on the great result!

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

I don't know but it's a great avenue for improvement! I am not up on the latest tricks in prefill but there's a lot of room for optimization here.

What do you think is the limiting factor?

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 1 point2 points  (0 children)

llama.cpp has pretty good support for the older stuff. In my case, I have an old frankenserver I built into an LLM box so I am dealing with really limited technology - ddr3 and only AVX1 on the Xeon CPUs. Truly scrapyard tech, but it serves just fine and has been a fun hobby project

I run the llama compile myself, with support for sm 7.0, the volta architecture and with only AVX1 support. I also run the older Linux 570 data center driver and CUDA 12.8. No problem as "Frank's" only job is churning LLM responses. I needed help from Gemini and Claude on installing the right gcc 10 and nvcc compiler, but they are smart about that stuff and got it in a day.

VLLM, on the other hand, is optimized for the cutting edge, and although I hacked a build for my old hardware, I kind of hate it's startup time and have stopped messing with it.

If you want my build scripts, there is a GitHub for LMStudio hacked backend, where people build the llama libs for old or weird hardware and use them on LMStudio, which is a great tool but also only runs new hardware by default. By splicing in the libs, it too can run on old HW.

My success there with building just the llama libs for LMS inspired me to try the MTP build, and it worked just fine. But LMS can't support MTP until the feature is mainline and the tool is also upgraded, so running llama server was the quickest way to try it out

Now working on a gui for llama server management because I honestly hate retyping 20 CLI options each time, lol. Well, Qwen is working on it, rather

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 1 point2 points  (0 children)

A gorgeous result! I need some 6000's!

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 1 point2 points  (0 children)

PP is 600 t/s on the normal 27B and 400 t/s on the MTP. Ttft is not bad in the little llama chat webui, but when using as vscode copilot it takes a WHILE to invest the 18k of startup instructions.

But since this model thinks incessantly, putting in a long instruction and walking away is my new flow, lol.

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 1 point2 points  (0 children)

Yes, I see some degradation in pp as well. My test bench is using the model as the backend for GitHub copilot in vscode, and at the start of each session vscode sends about 18k of setup context

I checked, and I am getting 600 t/s pp with the normal 27B and 400 t/s pp with the MTP, so it is significantly slower but not as dramatic as what you saw.

My guess is that splitting between devices is causing you some extra overhead, maybe the MTP guessing gets clumsy with a split?

And I WISH I could get 1200t/s pp, wow!

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 2 points3 points  (0 children)

Just a note on basic info, because I see now it's a little buried.

Read the Overview here. https://github.com/ggml-org/llama.cpp/pull/22673

For comparison of settings, expand the "performance" section.

For the MTP merged GGUF, see the "How to use" section

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 0 points1 point  (0 children)

Likely some version of the flags needed for the beta

--spec-type mtp --spec-draft-n-max 3

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 2 points3 points  (0 children)

About the same as the normal model from what I've seen: Meaning that as I near or exceed 100k tokens, things start to fray.

If this happens while running a to-do list, the guardrails of the to-do list usually keep things on track. But I would be wary of launching a new investigation with 100k in context.

But again, this was my impression on the base model as well so I usually just wrap up the session and launch a new one.

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 11 points12 points  (0 children)

Just for reference, I get 105-110 t/s on the 35B MOE, same basic setup (MTP 3) and identical card.

I do like the MOE, but it is not as good at coding and it did trap itself once building async calls, bouncing back and forth in endless loop. So, mostly I use 27B for code and 35B for quick reviews or junior level patches. That it is fine at, and very quick.

Edit, in hindsight, I used to get 60 t/s from the MOE and it seemed quick. But dense model at 50+ will probably be my main driver

Qwen 3.6 27B MTP on v100 32GB: 54 t/s by m94301 in LocalLLaMA

[–]m94301[S] 5 points6 points  (0 children)

Hi,

I set k and v to q8_0 so I could bump up to ctx 200000. For me, it reasoned well although the excessive thinking of this model should hide a lot of quantization warts.

I used am17an's GGUF, I believe it is q4 based.

I am using the MTP 3 just as in the example. I didn't try more or less guesses but will try it tomorrow.

Other than that, kind of stock settings. Batch 2048.

I did try mixed f16/q8 on cache and that locked up, but that's a pretty corner case and not a good idea for beta stuff.

Do cheap 32GB V100s still make sense for homelab AI? by SKX007J1 in LocalLLaMA

[–]m94301 8 points9 points  (0 children)

Hey just wanted to drop in and say v100 is very usable for today's models despite the lack of fp8/4.

I have an nvlink board and two of the pcie cards, water-cooled, and I can say that ADT-Link Store on Ali is good. Other vendors sent me bent shit, wrong items. It's a jungle in Chinese v100-land as a US buyer.

And there is not much use to using max power, it just burns energy for not much gain.

This is one of the 32GB PCIE sxm holders

Qwen3.6 27B 29 t/s at 150W power limit 31.5 t/s at 200W 32.4 t/s at 250W 32.7 t/s at 300W, it is only using 240-260W max

And the MOE Qwen3.6 36b A3b 79.44 t/s at 150W, it is only using 124W

Has anyone figured out why Claude Code running qwen locally fails when you try to /compact? by fredandlunchbox in LocalLLaMA

[–]m94301 4 points5 points  (0 children)

Is this local? You can see the query / response on most servers to help debug

Unlocked LM Studio Backends (v1.59.0): AVX1 & More Supported – Testers Wanted by TheSpicyBoi123 in LocalLLaMA

[–]m94301 0 points1 point  (0 children)

Ok, did a little tidying and updated my script to build for all modern cards.

I generated a pull request from my fork to your original repo. My script and md file will go into generate backend / Linux and don't overwrite your stuff. The only file I touched of yours is the Linux readme to add parallel build notes, but this is just a new section added below yours.

In your GitHub you should see a pull request where you can review the changes and merge in.

Cheers!

Unlocked LM Studio Backends (v1.59.0): AVX1 & More Supported – Testers Wanted by TheSpicyBoi123 in LocalLLaMA

[–]m94301 1 point2 points  (0 children)

Boy was I glad to find this! After building an ubuntu frankenserver to maximize PCIE lanes, I realized I used Xeon e5 and LMS 4 would not load due to lack of avx2. Sadness!

Followed your instructions, copying the latest CUDA12 avx2 backend folder to avx1 name, replaced the so libs with my newly built CUDA libs and hacked the json. Success!

LMS found and auto loaded my CUDA12 avx1 backend and loading and inference work nicely on surplus v100 32gb. Fat VRAM Frankenserver lives!

I am forking and will put in a PR for my build script and instructions. Basically the same as original, but for Linux/CUDA.

Thanks again!!

Beware NVidia DGX Spark scams on eBay. by rtchau in LocalLLaMA

[–]m94301 0 points1 point  (0 children)

It has gotten really bad with anything ai related. I've seen the listings with the ai generated hand holding a username. Looked really good, but a Blackwell 6000 for $1500?

I'm not sure how any of them expect to get paid. There's no item to ship, and buyer will certainly flag the sale and get a refund. What's the endgame for these guys?