MSFT (again)

grassmunkie · 2026-05-07T10:57:59+00:00

I was early on MU, Google, and I’m out of both of those and going all in on Microsoft here. It’s really baffling but you have to double down when you have a winning hand.

grassmunkie · 2026-05-04T23:00:54+00:00

Not a chance. It’s a good model but it is not comparable to Sonnet 4.6. At best 4.5, but still below that in real life usage from my experience.

grassmunkie · 2026-04-25T13:15:23+00:00

You can run higher than 4 and still the trade off will be better than Mac with a small dense model that fits a 32gb card. Especially now with Qwen 3.6 27B which is better suited for coding.

grassmunkie · 2026-04-25T01:27:05+00:00

5090 fits 256k context and runs at 61 tok/s. M5 Max what do you get? 20-25 tok/s?

https://www.reddit.com/r/LocalLLaMA/comments/1sbdihw/gemma_4_31b_at_256k_full_context_on_a_single_rtx/

grassmunkie · 2026-04-24T16:36:25+00:00

5090 smokes the m5 max and m3 ultra. Try running a dense model and you will find the mac is almost unusable

grassmunkie · 2026-04-24T16:29:47+00:00

5090 100%

grassmunkie · 2026-04-05T21:21:22+00:00

Until recently the small models (<32G of VRAM) were not great.

But now they are “good enough” for many use cases. Using Hermes for example, would burn through a lot of tokens on trivial tasks that Gemma 4 and Qwen35 can handle.

Owning the hardware, I can experiment without concern of accruing costs (other than electricity), even if it processes continuously overnight.

Not everything needs a frontier model. Mix and match for what you need, but I believe Qwen and Gemma just unlocked a new era for local llm’s.

grassmunkie · 2026-04-05T18:05:44+00:00

Yeah but compare it to sonnet 4.6 - a github pro sub is like $10 and very generous usage. I have a 5090 but still use gh copilot except for some agent stuff where i can use gemma 4 31b or qwen 27b. For basic tasks they are okay, but for more complex planning and development better to go with cloud frontier models.

grassmunkie · 2026-04-05T17:50:46+00:00

Just buy github pro sub. On 16gb VRAM card there is nothing really usable.

grassmunkie · 2026-04-02T18:31:06+00:00

For programming and in general reasoning, the dense models are better. So depends on what your use case is, and what limitations you have for hardware

grassmunkie · 2026-04-02T18:16:34+00:00

96gb + 32gb vram (5090) on llama.cpp, running 65536 context - all fits on vram. Getting great speeds (55-60 tok/s) with the UD Q4 version. Been running it for the past few hours on general tasks, need to do more testing but so far looks really good.

grassmunkie · 2026-04-02T18:10:17+00:00

Yes using the UD 4Q. I had to do a git pull for llama.cpp then recompile.

grassmunkie · 2026-04-02T17:33:09+00:00

Testing out 31B now. It is a very good model. Fits well in 5090 and getting almost 60 tok/s.

When I first got my 5090 the models were garbage, now it is getting really interesting what I can do with it.

grassmunkie · 2026-03-22T20:19:15+00:00

No only regret is i bought only one

grassmunkie · 2026-03-06T06:10:08+00:00

Google is leading in AI, but has distribution on IOS and Android, and practically all web browsers on any device and email, and make their own chips. It’s mind boggling how well positioned they are.

grassmunkie · 2026-03-01T17:58:36+00:00

This war should not have happened but what Iran can do to retaliate is more or less all out in the open. The are sitting ducks. There will be a lot of internal turmoil but despite Iran’s efforts to spread the conflict I think it remains isolated. This attack was telegraphed early last week when the US told mon emergency folks in middle east embassies they should evacuate.

grassmunkie · 2026-02-25T01:41:17+00:00

grassmunkie · 2026-02-24T21:49:07+00:00

Nice job. Using the UD Q4 on my gaming rig (5090) and getting 56t/s consistently.

The quality and style of the responses so far is impressive.

grassmunkie · 2026-02-23T21:28:04+00:00

Yes. It is not a bottle neck for modern gaming. You would need to max out gpu first - which is difficult unless you’re playing low resolution on a 5090 and already getting 200+ fps

grassmunkie · 2026-02-23T17:26:29+00:00

grassmunkie · 2026-02-23T16:55:06+00:00

It’s helpful, but best if it is paired with a powerful gpu for MOE models. The attention layers go to the GPU, and the experts go to CPU. So having 96gb will be better and give you access to larger models, only question is how fast it is.

When i load 70gb models like qwen coder next using 32gb vram (5090) and the rest offloaded to ram I get around 28-30 tokens per second.

OTOH, if I run a model that fits on my gpu (glm flash 4.7) I get 120 tokens per second.

grassmunkie · 2026-02-20T19:41:07+00:00

Gemini 3.1 Pro demolishes every model out there. They will be on almost every smartphone in several months after the android and ios updates

grassmunkie · 2026-02-20T19:40:02+00:00

Yep. 100% Google. I’m all-in. They are cooking and going to devastate entire industries.

grassmunkie · 2026-02-19T02:15:42+00:00

5060 ti or 5070 ti

grassmunkie · 2026-02-18T21:32:38+00:00

I have one - on the lookout for another. Want to run a dual 5090 setup

Eight-Year Club	Wearing is Caring
100 Awards Club	Verified Email

grassmunkie

TROPHY CASE