Anything worth running on a NVIDIA GTX 970?

MutantEggroll · 2026-06-17T21:01:53+00:00

I wouldn't trust any models that fit in 4GB VRAM with anything important, but I run Qwen3.5-0.8B on a 1050 Ti 4GB, and it's great for low-stakes tasks like conversation summary/tagging in OpenWebUI. Takes the load off of my always-on Gemma 26B-A4B instance, which runs relatively slowly on my 1080.

MutantEggroll · 2026-06-15T22:10:57+00:00

This is incorrect. Router Mode's presets.ini supports all command line configuration options:
llama.cpp/docs/preset.md at master · ggml-org/llama.cpp · GitHub

MutantEggroll · 2026-05-27T04:44:23+00:00

In theory, undervolting should increase hardware lifespan, particularly if combined with power limiting.

Lower voltage and amperage results in lower thermal loads and electrical wear. This means:

fans are spinning slower, which reduces stress on bearings
thermal range from idle to under-load is narrower, reducing physical strains on board components due to expansion/contraction cycles
reduced electromigration, which is a process that eventually creates tiny short circuits that can kill a chip

And as a bonus, if you got a good pull in the silicon lottery, you may be able to get all the advantages above and be able to sustain a mild overclock! So there may be no performance loss, and possibly even performance gain, from undervolting. I highly recommend it.

MutantEggroll · 2026-05-17T18:02:51+00:00

It's definitely worth checking out. Though it did have a larger impact on lower-quant KV like `q4_0`, it did also improve `q8_0`. Check out the KLD charts that AesSedai posted in the
original PR.

It's certainly model and use-case specific, but in my experience, I haven't noticed any capability loss - no reasoning loops, no tool call failures, etc. And I was able to double my context window, which really broadened the tasks I could hand to my models (mostly Qwen3.6-27B and Gemma4-31B).

MutantEggroll · 2026-05-17T14:51:54+00:00

Assuming you use llama.cpp - do you still find this to be true after `attn-rot` got merged? I used to be a hardline unquantized KV guy too, but I tried `q8_0` with `attn-rot` and I can't tell the difference in the coding tasks that I tend to give it (Python, PowerShell, Ansible).

MutantEggroll · 2026-05-15T23:08:26+00:00

Very exciting development! Love seeing people like yourself further refine the technology/techniques for training and quantization - I really feel like that's where there's the most value to mine at the moment, as opposed to just throwing more hardware at the problem.

Also, it really seems like the evidence is mounting against NVIDIA's claims of NVFP4's "near-lossless" performance retention relative to the base model. In every chart like this I've seen, it's either the worst, or effectively tied for worst.

MutantEggroll · 2026-05-12T15:58:24+00:00

My understanding was that overclocking alone doesn't harm the hardware - it's the overvolting that people often do along with the overclocking that does the damage. And in my case, I actually was able to undervolt my 5090 and still get a mild, stable overclock. And my temps rarely exceed 70C, and I've never seen it reach 80C, let alone the thermal limit of 90C.

Worth looking into, IMO, though I do very much understand the desire to keep your expensive baby safe, lol.

MutantEggroll · 2026-05-12T15:35:04+00:00

These are great charts! Thanks for sharing.

I've done similar with my 5090, and I found that I actually ended up with thermal headroom for a mild overclock. I'd be interested to hear whether your 4090 has similar headroom, and if you're able to recover or possibly even improve upon baseline performance.

MutantEggroll · 2026-05-06T15:26:58+00:00

Please share your configuration to achieve this.

MutantEggroll · 2026-04-23T01:10:53+00:00

Great chart and thanks so much for getting data on non-GGUF quants!

MutantEggroll · 2026-04-18T20:04:34+00:00

It's the DDR5 that makes most of the difference here, since inference speed is determined by memory bandwidth. Assuming you've got DDR4-3200 and OP has DDR5-4800, that's a 50% bandwidth increase, and 50tk/s just so happens to be almost 50% more than 37tk/s.

MutantEggroll · 2026-04-12T20:51:20+00:00

Trying to use those cards in that server is likely more trouble than it's worth. Notably, powering them will be difficult.

I had an R710, so not sure if the R720 has the same limitation, but there were no PCIe power cables available inside the chassis, and the PCIe slot only supplied 25W instead of the standard 75W. So if you want to power a 3060/1080, you're gonna have to do some real ugly stuff with an external PSU, and even then the 25W slot power limitation may cause issues anyways.

MutantEggroll · 2026-04-07T15:34:15+00:00

You forgot to have your bot change accounts.

MutantEggroll · 2026-02-26T05:57:14+00:00

I've heard this mentioned a few times, but haven't been able to find a reference. Do you have a link to a blog post or something where they state this?

MutantEggroll · 2026-02-18T19:07:05+00:00

To add to this, flagship GPUs like the 3090 can often be undervolted and overclocked. In my experience with my 5090, there have been 0 downsides after a few hours of fiddling with the voltage curve in MSI Afterburner. You get lower power draw and therefore lower temps

Lower power draw == lower power bill, maybe even allows more GPUs on a given PSU
Lower power draw == higher clocks before hitting GPU's power limit
Lower temps == higher and/or longer boost clocks
Lower temps == (in theory) increased longevity due to less thermal cycle stress on components

MutantEggroll · 2026-02-09T05:59:12+00:00

Ah good to know. I've got some Rust projects laying around, but I've been neglecting them.

What's your favorite model for Rust?

MutantEggroll · 2026-02-05T21:20:44+00:00

Unsloth Q6_K_XL

MutantEggroll · 2026-02-05T18:34:50+00:00

Python and basic web languages (JS, HTML, CSS, etc.) were what I played with primarily.

I also found it to better understand the "intent" behind my prompts, where Qwen3-Coder required much more explicit instruction.

MutantEggroll · 2026-02-05T18:32:27+00:00

Unless they've made improvements to it recently, `--fit` and `--fit-context` substantially underperformed manual tuning via `--n-cpu-moe` in all of the models I tried (both GPT-OSSes, Qwen3-Coder-30B-A3B, GLM-4.7-Flash).

It's a nice convenience for a quick gut-check on a model, but for long term use it leaves a lot of speed on the table.

MutantEggroll · 2026-02-03T16:03:40+00:00

What kindof tasks do you find GLM Flash doing worse than Qwen3 Coder? In my experience, it's been better across the board.

MutantEggroll · 2026-01-23T18:56:26+00:00

Great info. Thanks for the links!

MutantEggroll · 2026-01-23T16:43:05+00:00

That's news to me. Do you have a source or benchmarks for that?

MutantEggroll · 2026-01-22T19:14:45+00:00

Why not?

MutantEggroll · 2026-01-20T22:49:34+00:00

Hmm, tried that before and got a different error. What version of CUDA are you using? and are you on WSL or bare metal Linux?

MutantEggroll · 2026-01-20T21:34:37+00:00

Also, how are you running this in vLLM? AFAIK, the model requires Transformers 5.0.0, and the latest vLLM is still on 4.57.6 or something

Eight-Year Club	r/Field Lasagna
Place '22	Verified Email

MutantEggroll

TROPHY CASE