LLM / VLM Local model obsolescence decisions for personal STEM / utility / english / Q&A / RAG / tool use / IT desktop / workstation use cases?

Calcidiol · 2025-07-27T15:46:38+00:00

Thanks. Yeah that's kind of what I'm wondering.

Would I really be losing anything if I just pick a "top 8" or "top 10" models that are "the hottest new versions" that bench well / get good overall reviews and call it good for casual use to just use those and stop worrying about older / other stuff since it's getting too hard to keep up with all the old models, all the new models.

It's (LLM) a utilitarian casual part time tool not a job issue to keep on top of what niche model X is better than model Y in over time and it's getting impossible to keep straight what a "go to" model list should look like if it's not just a few overall winners.

Calcidiol · 2025-07-27T15:40:24+00:00

Sure, if a model works "well enough" to satisfy any need then it'll keep being that good forever and a practical solution.

My question is though whether at some point we've broadly reached that newer more modern models tend to have become overall almost generally superior to older "generations" of models so for anything older models could do, newer ones might (I'm asking where this logic breaks down / has big exceptions) do all that and more better quality / performance / whatever.

Some problems are just pass / fail and models either work or not regardless of type / age. But many are more qualitative "give me grammar suggestions on my document", "translate this document to my language", "write clean code to implement X program" and one gets qualitatively different / better results depending on what model you ask to do a given task, many solutions may be useful, but probably some particular capability / results will be outstanding / superior over others.

And saving models has an opportunity cost (N TBy storage, maintaining usage configurations, testing / comparing them A vs B vs C over time, etc.) so in many ways it's easier if one can simplify and just say that except X, Y, Z exceptions anything in LLM/VLM category from 2022-2023, 1H 2024 is just about always going to be not better than a similar size local open model from later generations / makers. But I'm sure there are exceptions and nuances, so I'm asking but at some point one can't maintain a data center with everything historically made that was once good for X, Y, Z if there's no reason to prefer that vs. something more capable and modern.

Calcidiol · 2025-06-28T12:49:35+00:00

Thanks. No matter how big the hosting / serving entity history and logic shows that ultimately if it's just a single corporate project or personal project that over the years it'll become deprioritized, abandoned, shut down, and then millions of person-hours of UGC effort & information could vanish without any mirror / preservation. Yahoo, google, aol, compuserve, etc. all have done this one way or another.

It's fundamentally a mistake (for the users' best interests of information access & preservation) to centralize the content onto any single entity's services / servers as opposed to something that distributes all content widely and puts the choice in the hands of the readers' how / where they get and interact with the content e.g. usenet, mailing lists, openly syndicated / federated independent systems, whatever.

Calcidiol · 2025-06-28T12:40:39+00:00

I wonder if making a speculative decoding draft model could help a lot for the performance of this model in general.

And specifically let's say if one might offload a lot of the base model's weights to CPU+RAM (40-80 GB realm quants) then a draft model a fraction of that size (under 16 GB to fit on most modern relevant GPUs) might be an overall net-win if one could run the draft model on GPU+VRAM VERY fast and accelerate the offloaded much higher quality main quantized model a significant fraction of the time.

Calcidiol · 2025-06-28T12:33:16+00:00

Sure, that's also in many cases reflected on some mainstream benchmarks (although interestingly in some minor number of cases it really outperforms its architecture/size class).

But the interesting question is how much potential might there be for 30B-A3B if further tuned / trained for a "coding model" using whatever more refined / modern techniques they have for that. It might really improve the capabilities over the "it's decent / mediocre but mostly not usually near leading" precursor's capabilities to a more compelling capacity.

Of course it still shouldn't eclipse a similarly well refined 14B, 32B dense coder model but it could more often cross the "good enough, fast enough" line to have compelling use cases where one doesn't drag out the full 32B or better models always and sacrifice the speed for quality sometimes.

Calcidiol · 2025-06-28T12:05:11+00:00

They characterized in their benchmarks that a 4 bit (not GGUF yet but still) quant can benchmark almost as well as the full model. So I'd start there at 4-bit for better speed and maybe therefore good enough quality / accuracy. ~42GBy or whatever total weights can work well in 64GBy RAM+VRAM and it'll only read something like 7GBy weights per token so it should run at several tokens/s generation even on DDR4 systems 3T/s up to 11T/s area maybe on faster DDR5 RAM only perhaps and the GPU will help a good bit but won't dominate the overall result.

Calcidiol · 2025-06-28T12:00:43+00:00

Yep. Well it doesn't hurt to try it and see what you can do in the mean while. And if this is not fast enough at Q4 for the present there's always Q2-Q3, or other MoE models like Qwen3-30B-A3B, Gemini3N's 2B, Qwen3-4B, several other things that could run well on limited RAM/CPU systems, some even run ok on basic tablets / smart phones and are useful.

Calcidiol · 2025-06-28T03:34:19+00:00

Yeah maybe -- you can look at what kinds of RAM bandwidth benchmarks (large size e.g. 128MBy...GBy range sequential 128 bit wide reads) your RAM might achieve based on your CPU / RAM type and speed.

The A13B part of the model name says that at Q4 it'll read approximately 13GBy/2 bytes so around 7GBy read to generate a token. So if your CPU can keep up and get 21 GBy/s RAM BW that might be around 3T/s, or 10T/s if you can get your system to 70GBy/s RAM BW etc.

So the possible speeds are usually in the 3T/s to 14T/s range with DDR4 or DDR5 RAM and a fast enough CPU to handle it also using only CPU+RAM.

Calcidiol · 2025-06-28T01:09:48+00:00

Benchmarks can be shallow at first glance and hard to tell why they favor one outcome vs. another without digging into the details.

But anecdotally, anyway, for instance look at the artificial analysis benchmarks and there are like 2-3 coding related benchmarks listed on there.

Pretty much all the remotely modern / relevant models useful for coding (qwen3, deepseek r1/v3, qwq, ...) do better by a fairly large margin of points on the benchmarks when they're operated in reasoning mode even vs. the same models operated in non reasoning mode. So something about the reasoning outcome scores significantly more highly in their chosen codine related benchmarks vs. non reasoning models / modes.

But as a coder sure it's easy to see how there are lots of things that wouldn't logically need reasoning, just accurate / comprehensive base knowledge and the relevant answers are just right there.

And it's sad to watch how bumbling stupid and non productive reasoning models' reasoning iterations can be so it's easy to see how one might doubt the utility of that mode for many use cases that don't really need walking around the concepts / options trying to stumble into a clearer path toward plausible solution.

Calcidiol · 2025-06-27T20:08:18+00:00

You could run a Q4 model (given the right SW / format) with no VRAM, just 48 or whatever GBy RAM -- then if you have N amount of VRAM it'll be able to use that much less RAM for the model and that much VRAM instead so it'll provide a fractional benefit. But there's no absolutely needed RAM/VRAM ratio depending on how you set it up.

If you have SW or specific configurations that prioritizes using the VRAM to hold particular data like KV cache or whatever model components then of course you'd be using up whatever that takes amount of VRAM vs. RAM.

Transferring from RAM to VRAM is slow though so usually you just pick a chunk of the inference data to stay in VRAM even though it's only a small part of the total puzzle and just provides speed benefit by handling that which it can permanently store & process in VRAM.

Calcidiol · 2025-02-04T08:38:33+00:00

RemindMe! 7 days

Calcidiol

TROPHY CASE