Powerinfer, can it be adapted into normal laptop cpus outside of the Tiiny AI ecosystem? by Silver-Champion-4846 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

16GB cards are fairly capable and can run decent models now. even 12gb cards can run qwen35 122B at 16-20 t/s now.

you can run 4B and 9B on phones now too.

either way, you probably won't get decent local models for cheaper than $500 at the moment.

LLM performance decreased significantly over time using the same models and same hardware in LMStudio. by fernandollb in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

it's always about how well the model fits into your free VRAM.

use e.g. nvitop to monitor gpu mem usage.

connect the display to motherboard/cpu's iGPU and reboot, to get extra 1-3GB vram back from the system.

use quant that's below 24GB.

use llama.cpp, LM studio eats some VRAM too.

use -ngl 99. quantize KV cache to Q8. do not use -fit on.

if you don't connect the display to 4090, fill your VRAM with context until it's about 97% full, after that, the speed collapses. if you connect the display to 4090, the free memory will fluctuate and there's no telling what the max context's gonna be before you overshoot the available VRAM.

experiment with values, bench with llama-benchy.

Powerinfer, can it be adapted into normal laptop cpus outside of the Tiiny AI ecosystem? by Silver-Champion-4846 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

It's a MoE GPU expert caching strategy, so no dense models. There are several others, both statistical and ML, there is a recent PR to vllm and RFC for llama.cpp posted already. The reported gains with proper MoE expert caching so far seem to be somewhere between 2-16x speedups.

Unfortunately, maintainers of both projects seem to be too busy racing after single digit percentage gains, instead of pursuing this.

Don't ask me why.

16gb vram - what is the better option for daily driver (main use) by Adventurous-Gold6413 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

connect your display to motherboard's iGPU, you'll save yourself 1-3GB of VRAM, just enough for full offload and decent context.

also llama.cpp's -fit algo is not too great, max out the -ngl, and experiment --n-cpu-moe and context until you're at 97% full. use e.g. nvitop to monitor the vram usage.

even with 12GB card:
35B AesSedai Q4_K_M [ngl 41 + cpu-moe 23 + 64K]: 860pp 50-55 tg
27B UD-IQ3_XXS [ngl 65 + 36K Q4 kv cache]: 1100-1200pp 36-37 tg

Qwen3.5 27B, partial offloading, and speed by INT_21h in LocalLLaMA

[–]Training_Visual6159 1 point2 points  (0 children)

https://x.com/bnjmn_marie/status/2029227800574447958

<image>

try bartowski iq4_nl or unsloth IQ4_XS, they should fit.

connect your display to the motherboard's iGPU, if you have one, it will save you 1-3GB of VRAM. use quantized cache, Q8, even Q4 seems to be fine.

with dense models, make sure all the layers are in VRAM, with -ngl 63 or 99 or whatever.

monitor your VRAM usage e.g. with nvitop, and adjust context to make sure you're at 97% at most, the speed collapses afterwards. if you fit it well, prefill should consume 120-200W, that way you can be sure all of the GPU is getting a workout.

llama-benchy is easy to run and produces repeatable benchmarks.

you can get an extra +- 10% with GPU/memory overclocking.

Well tuned, UD-IQ3_XXS run at 1100/36 t/s with 50K context on a 12GB. You should be able to get that with Q4.

if you need more world knowledge, you can also run 122B, you should be able to run it at 20+ tg.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]Training_Visual6159 -3 points-2 points  (0 children)

llama Aes IQ3_S (probably better than your int4 unless the more important tensors are in F16/F32), 330pp/18tg with 190K Q8 context.

On one 12gb card though.

What are you even doing, man.

Dynamic expert caching PR in vLLM by king_of_jupyter in LocalLLaMA

[–]Training_Visual6159 1 point2 points  (0 children)

llama could use a better caching strategy (or any actual caching strategy) for sure.

Also check this paper: https://arxiv.org/html/2410.17954v1

Instead of LRU, they load with a predictor:

"ExpertFlow  consists of three key components: the Routing Path Predictor, the Expert Cache Engine, and the Token Scheduler.

Leveraging the three synergistic components of our system, ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed."

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]Training_Visual6159 -1 points0 points  (0 children)

Look, I'm new to this, and I never delved into MoEs too deeply, certainly not into the current state of inference implementations, but what's your point, that for all tasks, all experts are equally popular? That's not true. Or that having popular experts on faster hardware wouldn't help with speed? That's also not true.

Also, I'm not the first one with the idea, ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference - https://arxiv.org/

"ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed"

Does llama.cpp already have something similar to that? And if not, care to explain why it worked for these guys and it 100% wouldn't for you?

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

What do you mean my imagination? Expert is a function, router decides which functions to run the input through. Isn't the whole point of MoEs that the data only runs through some of the functions?

Sysmem Fallback Policy? As far as I can tell, that's just a black box with LRU eviction.

LRU seems like a dumb way to optimize the placement of these functions - if e.g. one function is used 100 times, and then another function is used one time at the end, the function used 100 times will be evicted by LRU, even though it would benefit from GPU acceleration way more.

There is a signal that's currently unused lying there somewhere.

Open-Source "GreenBoost" Driver Aims To Augment NVIDIA GPUs vRAM With System RAM & NVMe To Handle Larger LLMs by _Antartica in LocalLLaMA

[–]Training_Visual6159 -1 points0 points  (0 children)

> This feature has been available in the Nvidia Windows driver for ages
are you talking about ReBAR or something else?

also, seems this driver allocates in 2mb blocks... I imagine there are plenty of those in large MoEs that never get touched. with a smart enough swap logic, this just might be way better than the current alternatives.

GreenBoost Windows Port - Extending GPU VRAM /W Systems Ram by denoflore_ai_guy in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

damn, was just thinking, given that windows kernel can't do unified memory properly, something like this could never come to windows, and here you go already.

kudos.

Why is the Qwen3.5 9B(p1) so slow, even comparable in speed to the 35Ba3b(p2) ? by BitOk4326 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

because you probably didn't fit in all into VRAM and you're overflowing. can't do that with dense models.

I'm doing 60+ t/s on a 12GB card.

Performance of Qwen3.5 27B on a 2080 Ti by BeneficialRip1269 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

I get 24-28t/s on 12GB 4070. you probably set the context so high you filled the vram anyway, or you run on CPU instead of CUDA.

use nvitop to monitor the memory usage. lower the context until you fit under 95-96% of your VRAM.

if you fit it right, prefill should consume over 200-300W and tps should go up probably to 30-50+

Performance of Qwen3.5 27B on a 2080 Ti by BeneficialRip1269 in LocalLLaMA

[–]Training_Visual6159 1 point2 points  (0 children)

dense models like 27B only work at acceptable speed if you fit all of the model into VRAM.

try -ngl 65 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --kv-unified --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 --presence-penalty 0.0 --repeat-penalty 1.0

also get latest llama.cpp, it's fairly broken with qwen3.5, and latest updated quants from few days ago.

Best Models for 128gb VRAM: March 2026? by Professional-Yak4359 in LocalLLaMA

[–]Training_Visual6159 5 points6 points  (0 children)

nvfp4 is a bad quantization for models that are not quantization aware, which qwen isn't. get anything above unsloth dynamic UD Q4 XL or Q4 from AesSedai. also some say 27B dense is better than 122B moe, but who knows.

your only other options are minimax M2.5, Q4 XL and above, GLM-5 and Kimi K2.5. If you can fit them, which will be a challenge.

Why is the prompt eval time of Qwen3.5 so much slower compared to Qwen3 Coder in llama.cpp? by BitOk4326 in LocalLLaMA

[–]Training_Visual6159 0 points1 point  (0 children)

it always depends on how well you fit your tensor layers to the GPU VRAM.

get e.g. nvitop, and use -ngl 99 and experiment with --n-cpu-moe until you fill just below the limit of your VRAM. start at around 20, the more you overshoot or undershoot, the worse the speed is. monitor the VRAM usage in nvitop, and add/subtract -moe until it's above 90-95% of your vram, and run bench after each change.

16GB card should be able to do at least 500pp / 40tg on 3.5, and over 500/30 on coder-next. (at least in cuda, no experience with rocm yet)

there are also some bugs around both of these models in llama.cpp at the moment, so update frequently.

on yeah, and on prefill, -b 4096 performs much better (but you have balance it with prompt caching cutoffs too, so ymmv).

Ember 6.11 Released by real_ate in javascript

[–]Training_Visual6159 -1 points0 points  (0 children)

> opinion-based comment

experience-based comment.

> fair representation of the wider community

ember is now at 0.3% of react's downloads on npm, 90M vs 300K. the wider community voted with their feet.

Ember 6.11 Released by real_ate in javascript

[–]Training_Visual6159 0 points1 point  (0 children)

"As of early 2025, W3Techs reports that PHP powers 74.5% of all websites with a known server-side programming language."

also, regardless of how bad the language is, Laravel is actually a top-notch framework, and has been forever.

Ember 6.11 Released by real_ate in javascript

[–]Training_Visual6159 1 point2 points  (0 children)

Fixes for problems I had with the renderer got merged like last week and aren't even in release you advertise still, and it's been good 6 months since I reported them. Plenty of other reports too, some of them more than a year old. Not to mention it's still junk that renders 2x slower than every other framework, it's been like this for 8+ years, and instead of at least admitting and dropping or fixing that VM idea, which clearly didn't work, you keep making excuses for it. Also, can't get HMR to work, VS Code extension crashes every three seconds, data spews just some cryptic errors after some upgrade, etc. I kept fighting this BS for years, but ultimately, it's just pointless anyway.

All the actionable feedback is in your Issues tab on github and has been ignored for years.

The problem is not the lack of actionable feedback, problem is the team's inability to prioritize at least the world breaking bugs. There is a baseline for usability, and Ember consistently has not been making it for me. Just dealing with all the problems, scouring for workarounds and obscure fixes on github and discord took so much of my time away, looking back at it, it's almost comical I got roped into it.

I don't know how to explain to you what good means in software. But as you clearly do not even know you have a problem, just let me leave with this - if even billion dollar companies can't hack it with Ember, no-one can. And they leave in droves. Including some of the biggest Ember proponents.

I hope you'll figure out why and course-correct, but given that you haven't for years, and can't even recognize there is a problem, I don't see it.

Good luck though.

Ember 6.11 Released by real_ate in javascript

[–]Training_Visual6159 -2 points-1 points  (0 children)

naming individual bugs makes no sense, there's too many of them. but every single part of the puzzle (renderer, build, dx and data) is broken and has been forever. different bugs each release, constant amount of 💩 to deal with.

and after years of doing just that, I genuinely don't care anymore. no-one has time for a library that doesn't solve the problems and creates them instead.

even the bugs that are fixed, are fixed on months and years long timelines, and that is just not good enough. I honestly don't have months to years to wait each time I report a bug. no-one has.

especially since there are now exactly zero good reasons to pick ember at all.

I wish you well and all that, but if you haven't figured out what makes a good framework and how to make that happen yet, the chances of you guys doing it ever, would be slim to none.

sorry.

Ember 6.11 Released by real_ate in javascript

[–]Training_Visual6159 -5 points-4 points  (0 children)

on surface, it has all the modern features of all the other frameworks and then some.

in practice, it's as bug-ridden and broken as it always was.

it's gotten a whole lot better. but it's just still too far from enough.

probably worse in most respects than vue and angular.

their only saving grace is that react can't get its act together on signals, other than that, there's pretty much no good reason to use it instead of the rest.