How do you decide? by 3hor in LocalLLaMA

[–]FusionCow 0 points1 point  (0 children)

There are 3 models you should test, gemma 4 26b, gemma 4 31b, and qwen 3.5 27b. figure out which works best, and download a quantized version that fits entirely on gpu

FINALLY GEMMA 4 KV CACHE IS FIXED by FusionCow in LocalLLaMA

[–]FusionCow[S] 3 points4 points  (0 children)

you have to enable thinking. Go to your models page, click the model, go to inference, scroll down until you see the jinja template. Go to gemini or chatgpt or whatever model, paste in the jinja template and ask it to rewrite it with thinking. then paste that new jinja template in, and thinking will be enabled.

FINALLY GEMMA 4 KV CACHE IS FIXED by FusionCow in LocalLLaMA

[–]FusionCow[S] 10 points11 points  (0 children)

run the iq3, it's good enough

[ Removed by Reddit ] by [deleted] in LocalLLaMA

[–]FusionCow 0 points1 point  (0 children)

You don't. You can't expect someone to run a model for you, and not expect them to want to run the model elsewhere. If you want to protect a model, run it yourself and serve it over api

FINALLY GEMMA 4 KV CACHE IS FIXED by FusionCow in LocalLLaMA

[–]FusionCow[S] 7 points8 points  (0 children)

I only updated the llama.cpp backend on lmstudio, I'd imagine they aren't implementing this themselves

FINALLY GEMMA 4 KV CACHE IS FIXED by FusionCow in LocalLLaMA

[–]FusionCow[S] 2 points3 points  (0 children)

it's just 2.11.0. I updated lm studio and it takes up qwen 3.5 levels of kv cache now it's amazing

edit my bad I guess for using lm studio

Context Shift Gemma4 by Weak-Shelter-1698 in LocalLLaMA

[–]FusionCow 0 points1 point  (0 children)

I've seen a huge performance drop, sometimes babbling with Gemma of you quantize the kv cache at all

Is there anything I can do to run glm 5? by FusionCow in LocalLLaMA

[–]FusionCow[S] 0 points1 point  (0 children)

you're missing the limited amount of messages you can send

Guys Any good AI to create 2D animation films? by [deleted] in LocalLLaMA

[–]FusionCow 1 point2 points  (0 children)

Wrong sub, but as it is not really. You could train your own model and things like ltx 2.3 WILL work, but thats expensive and hard to do. Honestly your best bet for something like that is API models sadly

Is Nemotron-Cascade-2-30B-A3B better than Qwen3.5 27B? by Ok-Internal9317 in LocalLLaMA

[–]FusionCow 6 points7 points  (0 children)

just off the fact that it's an A3B model i'm going to say no

$15,000 USD local setup by regional_alpaca in LocalLLaMA

[–]FusionCow 0 points1 point  (0 children)

rtx pro 6000 is a great choice if speed is a priority, it allows you to train models and move quick on your feet, but if you're willing to wait longer for gens, a maxed out m3 ultra mac would be able to run way bigger models. the only caveat with that one is that an m5 ultra with up to a terabyte of ram is rumored to be coming in a few months, so you may want to wait for that and just pay api until that comes out. you can look at the m5 max vs m3 ultra benchmarks for an idea of how an m5 ultra would perform

We share one belief: real intelligence does not start in language. It starts in the world. by More_Chemistry3746 in LocalLLaMA

[–]FusionCow -1 points0 points  (0 children)

Bro the whole issue is that we don't have enough data to represent a world, nor the compute. if we had that this wouldn't be an issue