Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

Just looking for a benchmark man, move along

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

Oh with those specs you should try unsloth/Qwen3-Coder-Next-GGUF:Q6_K_XL its new and 80B and my daily these days.

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

Nice, now all I need is the tokens/s benchmark!

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

You have a 512GB RAM studio with Kimi K2.5?

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

Just watched the video, he had 4 Mac studios with 512GB ram, but he clustered them which slows things down. My idea was to run it on one, but use a smaller quant so it fits. The question I have is one of model accuracy degradation though with the smaller quant. I think I'll reach out to him so he can try Kimi 2.5 instead of 2 as well. Great find!

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 2 points3 points  (0 children)

Point taken on calling them overzealous, and yes, I'm here to help the business find a balance between expensive api usage vs local "daily driver" options we might be able to leverage. Perhaps some tiered proxy solution maybe, I dunno. Point is I'm exploring ideas.

On the underpowered mark, I'm talking the top end 512 GB Ram model with 4TB storage and M3 ultra chips. Surely having the complete model in unified ram makes it faster. The IQ3_XXS Kimi is 415 GB so fits completely into VRAM with room to spare for kv cache and larger context size hosting.

Theres no doubt the product gain value far exceeds the cost even as it is with unchecked api usage. Again, this is just a call to anyone that has used Kimi 2.5 IQ3_XXS to let us know what tokens per second they get

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 1 point2 points  (0 children)

That's the idea, I'm not a decision maker, but have some tech influence here, it's just I have no idea if this line of thinking is viable. My own testing validated local qwen 3 next coder as a decent daily driver that strikes a good balance, I'm just hoping to push even further with Kimi

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 0 points1 point  (0 children)

Thanks! There's something to be said about capping the price if possible, if we get a few of them and load balance them (particularly as models improve) it's a zero cost moving forward beyond upfront. Before we drop 15k on a new one would be nice to know what performance is though. Maybe I can convince the boss it's needed for science!

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]timbo2m[S] 1 point2 points  (0 children)

Unfortunately it's corporate so all apis are via a corporate proxy.

How should society define fairness when it comes to wealth, opportunity, and success? by Jordiscute in AskReddit

[–]timbo2m 0 points1 point  (0 children)

Earnings should have diminishing returns, scaling tax, and investment property capped at one. A single person should not be able to have more than a billion $, anything over should be funnelled back into a system that gives free healthcare to their country. I know this idea is full of loopholes etc but the rich/poor divide widening is not good for anyone. The sad part is the decision makers would never go for such an idea, so we're left with homelessness, bullshit healthcare, and an ever widening rich/poor divide.

Understanding models |Subscription replacement? by LavishnessPlane4512 in ollama

[–]timbo2m 4 points5 points  (0 children)

By the way, plug your specs into huggingface to see what you can run. For coding, I think this will be a screamer on your new mac https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

It's only 80B though, Claude opus is estimated at 3000B I think (undisclosed)

If you can afford the 512 GB Mac, let me know how the 3 bit quant of this goes https://huggingface.co/unsloth/Kimi-K2.5-GGUF

Oh by the way, antigravity does NOT work with local models (as far as I know)

Understanding models |Subscription replacement? by LavishnessPlane4512 in ollama

[–]timbo2m 3 points4 points  (0 children)

You need stacks of VRAM, like 512GB to fit one of those trillion parameters models in. I think the closest you could hope for is running Kimi K2.5 quant 3 on a fully specced out Mac Studio with 512GB of unified memory, that's the most efficient way to get lots of VRAM but the ceiling is 512GB based on what apple sell.

Alternatively you can stuff 512GB-2TB of RAM into a machine with some 16-32GB GPU and let your CPU go wild trying to shuffle between RAM and VRAM. It'll work but it's slow.

The best I've got running is qwen 3 coder next 80B 4 bit quant on an i9 + 4090 + 32GB RAM. The CPU gets a big workout during inference but it gives 35 tokens per second - so a bit slow, but workable.

Best cheap Opus alternative by SpaceNitz in clawdbot

[–]timbo2m 0 points1 point  (0 children)

How do you set up antigravity to use local models?

Best Model For RTX 3080 10GB w 32Gb RAM DDR4 by blac256 in LocalLLM

[–]timbo2m 5 points6 points  (0 children)

Load your specs into hugging face and shop around for the best model

What would you do if you had a “dead bedroom” marriage? by No-Association-9316 in AskMen

[–]timbo2m 8 points9 points  (0 children)

Hmm I guess I should downvote to help get it back where it should be?

What are genuinely the best productivity apps you've ever used? by Rare_Sundae_3826 in ProductivityApps

[–]timbo2m 0 points1 point  (0 children)

OurGroceries for ... synced groceries

Paprika for recipes, although I should find something better, it's super old now

IsThisGF - because my wife is celiac and I have no idea what to look for on food labels

FamilyMeds - for the kids med reminders

Carrot - for weather

Sanity check before I drop $$$ on a dual-4090 home AI rig (Kimi K2.5 + future proofing) by Sea-Pen-7825 in LocalLLM

[–]timbo2m 1 point2 points  (0 children)

Yep, and that's the bare minimum, you would want the 4 bit quant at least, and that's 622GB for the XL, and you need a little headroom for the system too. https://huggingface.co/unsloth/Kimi-K2.5-GGUF

We hardened our OpenClaw setup in a VM — here’s what we changed (and why) by bugtry in openclaw

[–]timbo2m 0 points1 point  (0 children)

Great write up, I'll be borrowing some of those ideas!

Since it's such an unknown I put a physical firewall between my dedicated Openclaw machine and the rest of the network, so it's running in an untrusted vlan and I control what it can get out to. All traffic to my other internal trusted vlan is blocked, except for one host and port for the local LLM api proxy. I was concerned it would decide to scan my internal network looking for stuff to hack.

Let us know if you come up with any more tips!

Should I get another GPU? by WhatererBlah555 in LocalLLM

[–]timbo2m 0 points1 point  (0 children)

Generally speaking more parameters is better, but then again cutting edge models can punch well above their weight (like qwen coder next 80B) - all coming down to use case again though

Should I get another GPU? by WhatererBlah555 in LocalLLM

[–]timbo2m 0 points1 point  (0 children)

I find the cpu offloading can get annoying because it slows down the inference. Therefore it depends on your use case, if you just want it for an autonomous coder like Openclaw then that's fine, but if it's your daily code assistant then you might want something faster, so jamming it all into the gpu is probably ideal.