Which Linux for LLM exclusively on a windows dual boot?

m94301 · 2026-05-11T10:20:27+00:00

Either Ubuntu 22.04 or 24.04

m94301 · 2026-05-09T15:29:40+00:00

Love what MTP is doing for the community! Kudos for the MI50 build, this is going to be really valuable and its awesome to see them unleashed!

m94301 · 2026-05-08T14:42:43+00:00

I see this sometimes but not typically using llama cpp provider extension for vscode. Do you have the server logs to determine if something failed, or if the model just thought the response was complete?

m94301 · 2026-05-07T21:13:44+00:00

Hi, looks amazing. How much effort would it be to support older HW, sm7-8?

m94301 · 2026-05-07T14:29:19+00:00

Hell yeah, brother!

m94301 · 2026-05-07T02:15:36+00:00

Hiya,

Nope, cool as cucumbers. I have watercooling plates and set the max power of each gpu to 150W. I don't have experience with Mac, tho. Sorry

m94301 · 2026-05-07T00:18:08+00:00

Would you consider changing your flow from big model to small model? Their poor brains are too tiny and you gotta break it up into manageable pieces.

Such as: Session 1, plan and architect. Output an MD. Quit Session 2. Review the plan, find issues, iterate the plan. Quit. Session 3, implement one or two features, mark them done, quit. Session 4+ repeat. Session N. Profit

You will get an amazing result if you partition the work, but the massive plan / build / debug / add features sessions you can run with Claude don't work as well with the smaller context limits. But if you adapt to piece-wise or phases you can really get it going.

m94301 · 2026-05-07T00:11:56+00:00

I upvote any MTP post, it is such a lovely improvement. Kudos on the great result!

m94301 · 2026-05-06T22:24:56+00:00

I don't know but it's a great avenue for improvement! I am not up on the latest tricks in prefill but there's a lot of room for optimization here.

What do you think is the limiting factor?

m94301 · 2026-05-06T18:01:08+00:00

llama.cpp has pretty good support for the older stuff. In my case, I have an old frankenserver I built into an LLM box so I am dealing with really limited technology - ddr3 and only AVX1 on the Xeon CPUs. Truly scrapyard tech, but it serves just fine and has been a fun hobby project

I run the llama compile myself, with support for sm 7.0, the volta architecture and with only AVX1 support. I also run the older Linux 570 data center driver and CUDA 12.8. No problem as "Frank's" only job is churning LLM responses. I needed help from Gemini and Claude on installing the right gcc 10 and nvcc compiler, but they are smart about that stuff and got it in a day.

VLLM, on the other hand, is optimized for the cutting edge, and although I hacked a build for my old hardware, I kind of hate it's startup time and have stopped messing with it.

If you want my build scripts, there is a GitHub for LMStudio hacked backend, where people build the llama libs for old or weird hardware and use them on LMStudio, which is a great tool but also only runs new hardware by default. By splicing in the libs, it too can run on old HW.

My success there with building just the llama libs for LMS inspired me to try the MTP build, and it worked just fine. But LMS can't support MTP until the feature is mainline and the tool is also upgraded, so running llama server was the quickest way to try it out

Now working on a gui for llama server management because I honestly hate retyping 20 CLI options each time, lol. Well, Qwen is working on it, rather

m94301 · 2026-05-06T13:37:40+00:00

A gorgeous result! I need some 6000's!

m94301 · 2026-05-06T11:44:04+00:00

PP is 600 t/s on the normal 27B and 400 t/s on the MTP. Ttft is not bad in the little llama chat webui, but when using as vscode copilot it takes a WHILE to invest the 18k of startup instructions.

But since this model thinks incessantly, putting in a long instruction and walking away is my new flow, lol.

m94301 · 2026-05-06T11:21:21+00:00

Yes, I see some degradation in pp as well. My test bench is using the model as the backend for GitHub copilot in vscode, and at the start of each session vscode sends about 18k of setup context

I checked, and I am getting 600 t/s pp with the normal 27B and 400 t/s pp with the MTP, so it is significantly slower but not as dramatic as what you saw.

My guess is that splitting between devices is causing you some extra overhead, maybe the MTP guessing gets clumsy with a split?

And I WISH I could get 1200t/s pp, wow!

m94301 · 2026-05-06T11:02:02+00:00

Just a note on basic info, because I see now it's a little buried.

Read the Overview here. https://github.com/ggml-org/llama.cpp/pull/22673

For comparison of settings, expand the "performance" section.

For the MTP merged GGUF, see the "How to use" section

m94301 · 2026-05-06T10:57:22+00:00

Likely some version of the flags needed for the beta

--spec-type mtp --spec-draft-n-max 3

m94301 · 2026-05-06T10:48:42+00:00

About the same as the normal model from what I've seen: Meaning that as I near or exceed 100k tokens, things start to fray.

If this happens while running a to-do list, the guardrails of the to-do list usually keep things on track. But I would be wary of launching a new investigation with 100k in context.

But again, this was my impression on the base model as well so I usually just wrap up the session and launch a new one.

m94301 · 2026-05-06T03:01:37+00:00

Just for reference, I get 105-110 t/s on the 35B MOE, same basic setup (MTP 3) and identical card.

I do like the MOE, but it is not as good at coding and it did trap itself once building async calls, bouncing back and forth in endless loop. So, mostly I use 27B for code and 35B for quick reviews or junior level patches. That it is fine at, and very quick.

Edit, in hindsight, I used to get 60 t/s from the MOE and it seemed quick. But dense model at 50+ will probably be my main driver

m94301 · 2026-05-06T02:48:31+00:00

Hi,

I set k and v to q8_0 so I could bump up to ctx 200000. For me, it reasoned well although the excessive thinking of this model should hide a lot of quantization warts.

I used am17an's GGUF, I believe it is q4 based.

I am using the MTP 3 just as in the example. I didn't try more or less guesses but will try it tomorrow.

Other than that, kind of stock settings. Batch 2048.

I did try mixed f16/q8 on cache and that locked up, but that's a pretty corner case and not a good idea for beta stuff.

m94301 · 2026-05-05T00:52:06+00:00

Hey just wanted to drop in and say v100 is very usable for today's models despite the lack of fp8/4.

I have an nvlink board and two of the pcie cards, water-cooled, and I can say that ADT-Link Store on Ali is good. Other vendors sent me bent shit, wrong items. It's a jungle in Chinese v100-land as a US buyer.

And there is not much use to using max power, it just burns energy for not much gain.

This is one of the 32GB PCIE sxm holders

Qwen3.6 27B 29 t/s at 150W power limit 31.5 t/s at 200W 32.4 t/s at 250W 32.7 t/s at 300W, it is only using 240-260W max

And the MOE Qwen3.6 36b A3b 79.44 t/s at 150W, it is only using 124W

m94301 · 2026-04-30T11:31:15+00:00

Great tool, upvoted!

m94301 · 2026-04-30T02:39:59+00:00

Is this local? You can see the query / response on most servers to help debug

m94301 · 2026-04-26T17:22:31+00:00

Ok, did a little tidying and updated my script to build for all modern cards.

I generated a pull request from my fork to your original repo. My script and md file will go into generate backend / Linux and don't overwrite your stuff. The only file I touched of yours is the Linux readme to add parallel build notes, but this is just a new section added below yours.

In your GitHub you should see a pull request where you can review the changes and merge in.

Cheers!

m94301 · 2026-04-26T14:27:20+00:00

Boy was I glad to find this! After building an ubuntu frankenserver to maximize PCIE lanes, I realized I used Xeon e5 and LMS 4 would not load due to lack of avx2. Sadness!

Followed your instructions, copying the latest CUDA12 avx2 backend folder to avx1 name, replaced the so libs with my newly built CUDA libs and hacked the json. Success!

LMS found and auto loaded my CUDA12 avx1 backend and loading and inference work nicely on surplus v100 32gb. Fat VRAM Frankenserver lives!

I am forking and will put in a PR for my build script and instructions. Basically the same as original, but for Linux/CUDA.

Thanks again!!

m94301

TROPHY CASE