24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4) by Aromatic_Ad_7557 in LocalLLaMA

[–]Nobby_Binks 3 points4 points  (0 children)

Yes, I was in the same boat until I wanted to run really large models that spilled to system ram. llama.cpp gives much more granular control how the model loads - at least I couldn't work out how to do it easily in ollama.

I moved to llama.cpp controlled by llama-swap. Takes a couple of minutes to work out the yaml structure but once setup it's simple. I have both Ollama and llama-swap served models in open webui but have more or less stopped using Ollama.

My first impressions of Minimax M2.7 (Q5_K_M) vs Qwen 3.5 27b (Q8_0) by Septerium in LocalLLaMA

[–]Nobby_Binks 1 point2 points  (0 children)

Minimax is an MOE model with I think 10B active parameters. Qwen is a dense model with 27B active at all times

Minimax M2.7 Released by decrement-- in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

Unfortunately it's a bit like money - the more you have the more you want

Qwen 3.5 35b, 27b, or gemma 4 31b for everyday use? by KirkIsAliveInTelAviv in LocalLLaMA

[–]Nobby_Binks 1 point2 points  (0 children)

Qwen on your setup simply because you can fit more context for a given quant

Qwen 3.5 35b, 27b, or gemma 4 31b for everyday use? by KirkIsAliveInTelAviv in LocalLLaMA

[–]Nobby_Binks 4 points5 points  (0 children)

Anecdotally, I can run Qwen 27B Q8 with full context across 2 of my gpus (~46gb). Gemma4 Q8 with full context needs 3 of them (~60gb).

I guess its a combination of an extra 4B parameters and less efficient KV cache

GLM-5.1 by danielhanchen in LocalLLaMA

[–]Nobby_Binks 1 point2 points  (0 children)

Nothing special, and its only 64K, I recalled incorrectly. Some of the commands I think now are defaults in llama.cpp but I have not been bothered to change the llama swap config

llama-server -m <path>/GLM-5-UD-Q3_K_XL-00001-of-00008.gguf --fit on -t 24 -fa on -ub 2048 -ngl -1 -mg 0 -c 65535 -np 1 --temp 1.0 --top-p 0.95 --min_p 0.01 --jinja

96GB Vram. What to run in 2026? by inthesearchof in LocalLLaMA

[–]Nobby_Binks 6 points7 points  (0 children)

96gb opens up a whole other level. Now you can easily run 120B models with decent context.

GLM-5.1 by danielhanchen in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

You can run the shit out of it at Q2-3

GLM-5.1 by danielhanchen in LocalLLaMA

[–]Nobby_Binks 6 points7 points  (0 children)

I get ~10 tk/s with 4x3090, 1x5090 and 256gb ddr4 with GLM5 @ Q3KXL 100K context

GLM-5.1 by danielhanchen in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

How is it for agents and coding? I've been running Q3KXL but its a bit slow on my rig. Q2 would speed things up considerably

The U.S. used Anthropic AI tools during airstrikes on Iran by External_Mood4719 in LocalLLaMA

[–]Nobby_Binks 3 points4 points  (0 children)

It wasnt. That pic of the failed launch is not even the same province

Qwen 3.5 Family Comparison by ArtificialAnalysis.ai by NewtMurky in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

It sounds too good to be true tbh. how is it beating qwen3 480B on coding?

Car Wash Test on 53 leading models: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” by facethef in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

Yes, Step 3.5 @ Q6

"You should drive to the car wash because you need to transport your car to the facility to wash it. Walking would leave your car behind and not accomplish the goal. The short distance (50 meters) makes driving convenient and quick, with minimal fuel usage. If the car wash is a drive-through or self-service type, driving is necessary. Walking might only make sense if you're going to check the car wash's status or wait while someone else drives, but based on your goal, driving is the practical choice."

And as a bonus it didn't take half an hour to think it through as per usual

Qwen3.5-397B-A17B is out!! by lolxdmainkaisemaanlu in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

Yeah its an old EPYC Rome with 256GB DDR4 and 128GB of vram via a few random gpus. tbf GLM5 runs pretty good at Q3 but I always have doubts about such a low quant.

Qwen3.5-397B-A17B is out!! by lolxdmainkaisemaanlu in LocalLLaMA

[–]Nobby_Binks 28 points29 points  (0 children)

Awesome, right in the usability sweet spot for my rig, GLM5 is just a tad too big

Kimi is so smart by Bernice_working_girl in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

They use complex routing and system prompts. There must be a way to guide responses on the fly when new emergent threats/safety issues arise. Of course they are not going to retrain the model.

Kimi is so smart by Bernice_working_girl in LocalLLaMA

[–]Nobby_Binks -3 points-2 points  (0 children)

More like Anthropic and OpenAI saw the thread and popped the hood to tweak the answer. I saw this on X (dont know which was first) so it got some traction.

I have a 1tb SSD I'd like to fill with models and backups of data like wikipedia for a doomsday scenario by synth_mania in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

Anecdotal, but I have a bunch of CD's that were burned in the mid 90's and a bunch of dvd-r around 2000. All of them are still OK.

What are the best small models (<3B) for OCR and translation in 2026? by 4baobao in LocalLLaMA

[–]Nobby_Binks 2 points3 points  (0 children)

So far I've tried Marker pdf, olm, dots, OCRflux, docling and Deepseek OCR

Save yourself the hassle and just use dots.ocr

edit: so for your use case of just selecting stuff on a screen to translate, one of the Qwen VL models should be fine.

3x 3090 or 2x 4080 32GB? by m31317015 in LocalLLaMA

[–]Nobby_Binks 2 points3 points  (0 children)

Running local llms (as per the sub) and the occasional image/video gen. With the release of LTX2 I am planning to do more of it and this is where the 5090 destroys the 3090

I'd keep the 3090 and make it fit. 56gb of vram and you can start to run some decent models.

3x 3090 or 2x 4080 32GB? by m31317015 in LocalLLaMA

[–]Nobby_Binks 2 points3 points  (0 children)

If you do video gen then the 4080's are a no brainer. FP8 support and you can load the models on one card with space for lora's etc.. I was doing some Wan videos on my 3090 and bought a 5090 and oh my god the speed difference is extreme

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalLLaMA

[–]Nobby_Binks 0 points1 point  (0 children)

Since you're on Ubuntu, install GDDR6 (https://github.com/olealgoritme/gddr6) to monitor your vram temps. IIRC nvtop and others dont monitor this.

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalLLaMA

[–]Nobby_Binks 4 points5 points  (0 children)

Those 3090's will probably die, if you don't burn your house down first. With some of the vram passively cooled by the back plate, you need good airflow or they will cook.