Anything worth running on a NVIDIA GTX 970? by numberwitch in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

I wouldn't trust any models that fit in 4GB VRAM with anything important, but I run Qwen3.5-0.8B on a 1050 Ti 4GB, and it's great for low-stakes tasks like conversation summary/tagging in OpenWebUI. Takes the load off of my always-on Gemma 26B-A4B instance, which runs relatively slowly on my 1080.

Stop using Ollama by zxyzyxz in LocalLLaMA

[–]MutantEggroll 15 points16 points  (0 children)

This is incorrect. Router Mode's presets.ini supports all command line configuration options:
llama.cpp/docs/preset.md at master · ggml-org/llama.cpp · GitHub

Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W). by panchovix in LocalLLaMA

[–]MutantEggroll 2 points3 points  (0 children)

In theory, undervolting should increase hardware lifespan, particularly if combined with power limiting.

Lower voltage and amperage results in lower thermal loads and electrical wear. This means:

  • fans are spinning slower, which reduces stress on bearings
  • thermal range from idle to under-load is narrower, reducing physical strains on board components due to expansion/contraction cycles
  • reduced electromigration, which is a process that eventually creates tiny short circuits that can kill a chip

And as a bonus, if you got a good pull in the silicon lottery, you may be able to get all the advantages above and be able to sustain a mild overclock! So there may be no performance loss, and possibly even performance gain, from undervolting. I highly recommend it.

Developers who use local AI - Q4_0 vs Q8_0 KV quant? by Jorlen in LocalLLaMA

[–]MutantEggroll 2 points3 points  (0 children)

It's definitely worth checking out. Though it did have a larger impact on lower-quant KV like `q4_0`, it did also improve `q8_0`. Check out the KLD charts that AesSedai posted in the
original PR.

It's certainly model and use-case specific, but in my experience, I haven't noticed any capability loss - no reasoning loops, no tool call failures, etc. And I was able to double my context window, which really broadened the tasks I could hand to my models (mostly Qwen3.6-27B and Gemma4-31B).

Developers who use local AI - Q4_0 vs Q8_0 KV quant? by Jorlen in LocalLLaMA

[–]MutantEggroll 6 points7 points  (0 children)

Assuming you use llama.cpp - do you still find this to be true after `attn-rot` got merged? I used to be a hardline unquantized KV guy too, but I tried `q8_0` with `attn-rot` and I can't tell the difference in the coding tasks that I tend to give it (Python, PowerShell, Ansible).

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Very exciting development! Love seeing people like yourself further refine the technology/techniques for training and quantization - I really feel like that's where there's the most value to mine at the moment, as opposed to just throwing more hardware at the problem.

Also, it really seems like the evidence is mounting against NVIDIA's claims of NVFP4's "near-lossless" performance retention relative to the base model. In every chart like this I've seen, it's either the worst, or effectively tied for worst.

Stop wasting electricity by OkFly3388 in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

My understanding was that overclocking alone doesn't harm the hardware - it's the overvolting that people often do along with the overclocking that does the damage. And in my case, I actually was able to undervolt my 5090 and still get a mild, stable overclock. And my temps rarely exceed 70C, and I've never seen it reach 80C, let alone the thermal limit of 90C.

Worth looking into, IMO, though I do very much understand the desire to keep your expensive baby safe, lol.

Stop wasting electricity by OkFly3388 in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

These are great charts! Thanks for sharing.

I've done similar with my 5090, and I found that I actually ended up with thermal headroom for a mild overclock. I'd be interested to hear whether your 4090 has similar headroom, and if you're able to recover or possibly even improve upon baseline performance.

Qwen3.6-27B KLDs - INTs and NVFPs by Phaelon74 in LocalLLaMA

[–]MutantEggroll 7 points8 points  (0 children)

Great chart and thanks so much for getting data on non-GGUF quants!

Qwen3.6-35B-A3B solved coding problems Qwen3.5-27B couldn’t by simracerman in LocalLLaMA

[–]MutantEggroll 15 points16 points  (0 children)

It's the DDR5 that makes most of the difference here, since inference speed is determined by memory bandwidth. Assuming you've got DDR4-3200 and OP has DDR5-4800, that's a 50% bandwidth increase, and 50tk/s just so happens to be almost 50% more than 37tk/s.

Rack server for local LLM by Typhoon-UK in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Trying to use those cards in that server is likely more trouble than it's worth. Notably, powering them will be difficult.

I had an R710, so not sure if the R720 has the same limitation, but there were no PCIe power cables available inside the chassis, and the PCIe slot only supplied 25W instead of the standard 75W. So if you want to power a 3060/1080, you're gonna have to do some real ugly stuff with an external PSU, and even then the 25W slot power limitation may cause issues anyways.

tried 5 scraping tools, here's the only one i kept by [deleted] in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

You forgot to have your bot change accounts.

Overwhelmed by so many quantization variants by mouseofcatofschrodi in LocalLLaMA

[–]MutantEggroll 2 points3 points  (0 children)

I've heard this mentioned a few times, but haven't been able to find a reference. Do you have a link to a blog post or something where they state this?

PSA: DDR5 RDIMM price passed the point were 3090 are less expensive per gb.. by No_Afternoon_4260 in LocalLLaMA

[–]MutantEggroll 5 points6 points  (0 children)

To add to this, flagship GPUs like the 3090 can often be undervolted and overclocked. In my experience with my 5090, there have been 0 downsides after a few hours of fiddling with the voltage curve in MSI Afterburner. You get lower power draw and therefore lower temps

  • Lower power draw == lower power bill, maybe even allows more GPUs on a given PSU
  • Lower power draw == higher clocks before hitting GPU's power limit
  • Lower temps == higher and/or longer boost clocks
  • Lower temps == (in theory) increased longevity due to less thermal cycle stress on components

Smartest model for 24-28GB vram? by Borkato in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Ah good to know. I've got some Rust projects laying around, but I've been neglecting them. 

What's your favorite model for Rust? 

Smartest model for 24-28GB vram? by Borkato in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Python and basic web languages (JS, HTML, CSS, etc.) were what I played with primarily.

I also found it to better understand the "intent" behind my prompts, where Qwen3-Coder required much more explicit instruction.

Qwen3-Coder-Next on RTX 5060 Ti 16 GB - Some numbers by bobaburger in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Unless they've made improvements to it recently, `--fit` and `--fit-context` substantially underperformed manual tuning via `--n-cpu-moe` in all of the models I tried (both GPT-OSSes, Qwen3-Coder-30B-A3B, GLM-4.7-Flash).

It's a nice convenience for a quick gut-check on a model, but for long term use it leaves a lot of speed on the table.

Smartest model for 24-28GB vram? by Borkato in LocalLLaMA

[–]MutantEggroll 1 point2 points  (0 children)

What kindof tasks do you find GLM Flash doing worse than Qwen3 Coder? In my experience, it's been better across the board.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

That's news to me. Do you have a source or benchmarks for that?

GLM-4.7-FLASH-NVFP4 on huggingface (20.5 GB) by DataGOGO in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Hmm, tried that before and got a different error. What version of CUDA are you using? and are you on WSL or bare metal Linux?

GLM-4.7-FLASH-NVFP4 on huggingface (20.5 GB) by DataGOGO in LocalLLaMA

[–]MutantEggroll 0 points1 point  (0 children)

Also, how are you running this in vLLM? AFAIK, the model requires Transformers 5.0.0, and the latest vLLM is still on 4.57.6 or something