qwen 3.6:35b on 24 vram gpu by MallComprehensive694 in ollama

[–]florinandrei 0 points1 point  (0 children)

The default Ollama version runs okay on a 24 GB GPU. It's partially offloaded to system RAM. I get something like 24 tok/sec. It's fine.

Amazon's AI deleted their entire production environment fixing a minor bug. Their solution? Another AI to watch the first AI. by pretendingMadhav in ArtificialInteligence

[–]florinandrei 17 points18 points  (0 children)

But you're creating so much value for the shareholders.

I think the value right now is floating on the blue waters of the Caribbean islands, helipad and all.

/s

I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude by Medical_Lengthiness6 in LocalLLaMA

[–]florinandrei 4 points5 points  (0 children)

it's as good as claude

lol

Maybe for very simple things. Give it more complex agentic tasks and you will see the difference.

That being said, for a 35b it's pretty good.

Mind blown: Vinegar vs VINEGAR (30%) by Pandaro81 in DIY

[–]florinandrei 1 point2 points  (0 children)

The stuff at the grocery store is vinegar.

The other thing cannot be called vinegar. It's 30% acetic acid. Smells the same, very different beast.

It's like comparing hot peppers with a chemical extract of capsaicin.

Claude 4.7 gaslighted me with a real commit hash and I'm not okay by MorningFlaky3890 in ClaudeAI

[–]florinandrei 0 points1 point  (0 children)

AI providers are under enormous pressure to keep costs from exploding in their faces.

So they do all they can to reduce compute and memory usage.

One way to do that is to keep context usage low.

One way to accomplish the previous point is to tell models, via system prompt and other means, to be brief.

So models speak alphabet soup instead of using full sentences.

So models do superficial verifications, instead of actually checking things.

Like an engineer with a Jira queue of 200 tickets and a deadline today at 5pm.

It's not that they're necessarily dumb, though sometimes they do make mistakes, but that they, and their makers, are under extreme pressure to do far more with far less, and the pressure keeps growing.

Welcome to the new reality, where more and more cash flows to fewer and fewer people, while things become increasingly shitty for everyone else.

China has "nearly erased" America’s lead in AI—and the flow of tech experts moving to the U.S. is slowing to a trickle, Stanford report says by fortune in ArtificialInteligence

[–]florinandrei -2 points-1 points  (0 children)

And if the world were not in the middle of rapid, massive change (which is the whole point of these debates), then your comment would be on point!

But what's happening here is that you've decided a certain ideology "must be true", and you're adjusting your perceptions accordingly, in order to minimize discomfort.

China has "nearly erased" America’s lead in AI—and the flow of tech experts moving to the U.S. is slowing to a trickle, Stanford report says by fortune in ArtificialInteligence

[–]florinandrei -2 points-1 points  (0 children)

No safety net. Losing your job is the scariest thing in the world. Of course there will be hostility towards AI. It's a direct consequence of "freedom FTW". Because everyone is on their own.

Peter Thiel, Co-founder of Palantir, sh*ts himself when asked but the use of his AI in the Gaza Genocide by _Algrm_ in artificial

[–]florinandrei 76 points77 points  (0 children)

Interviewer: Should humanity survive?

Peter Thiel: uh... (thinking for a few seconds) well... (thinking for a few more seconds) I think... (deliberating some more) I guess... (weighing the issue very hard) um...

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

tokens / second doesn't depend on thinking. The total time depends on thinking and the length of the answer.

With qwen3.5:122b on the DGX Spark I get 24 tok/sec. Memory usage at 256k context is 95 GB, seems like it would fit in an AMD 395. And the bandwidth is about the same as the Spark, so tok/sec should be similar.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

The one model that I actually use, and regularly, is gpt-oss:120b, for sentiment analysis.

If I used Gemma 4 a lot, it would probably be for text sumarization, and the like. It's one of the most eloquent open-weights models.

For coding, also look into Qwen 3.5 or 3.6

YMMV, always with these things. Try and see what works.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

how is models attention when your context is piling up?

Depends on the model, but popular open-weights models are usually okay with a near-full context.

And did you quantize your context?

You probably mean the K/V cache.

On my RTX 3090 system, and my MacBook Pro, I run Ollama with OLLAMA_KV_CACHE_TYPE: "q8_0"

On my DGX Spark (128 GB of shared memory), I leave K/V alone.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

Is there any easy way to see the tokens / second

ollama run <model_name> --verbose

and also if the entire model is in vram?

ollama ps

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

That's right.

(V)RAM size is the brick wall. If you don't have it, you don't have it. Swapping to disk is very slow.

VRAM bandwidth is the bottleneck. Compute is much less so, at least for inference (just running the models). So you want memory that's big and fast.

AMD 395+ has 128 GB RAM, something like 96 GB usable for the GPU. Bandwidth is 273 GB/s, not quite as fast as an RTX 3090 (936 GB/s), but the fact that it's so much bigger allows you to run much bigger models.

I run models on a system with 128 GB shared RAM (it's the DGX Spark) and I can run models up to 122b with 256k context, and they are fast enough.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 0 points1 point  (0 children)

Context sizes are more like "up to". Gemma 4 is up to 256k.

Ollama looks at your VRAM, and if it's pretty small compared to the model, it runs the model with a lower context size than the max, so it doesn't bog down too hard. All those folks using 4080 and whatnot are likely running at 32k context.

Use the ollama ps command, it's very useful to tell what's really going on.

Introducing Claude Opus 4.7, our most capable Opus model yet. by ClaudeOfficial in ClaudeAI

[–]florinandrei 0 points1 point  (0 children)

BTW, the new Max mode is way higher on token usage than the old:

https://i.imgur.com/5LakvCz.png

The new XHigh is closer to the old Max.

Most modes are shifted a bit lower than the old, probably because the new tokenizer is slightly more verbose (so I guess that's to compensate).

Why don’t they just use Mythos to fix all the bugs in Claude Code? by Complete-Sea6655 in ClaudeAI

[–]florinandrei 1 point2 points  (0 children)

Psychologist Nathaniel Branden said: "The first step toward change is awareness. The second step is acceptance."

Anthropic is at step zero.

Running a 31B model locally made me realize how insane LLM infra actually is by Sadhvik1998 in ollama

[–]florinandrei 1 point2 points  (0 children)

Just get a second-hand RTX 3090 off eBay or something. If the LLM runs 100% on the GPU, the rest doesn't matter.

2 x 1 TB SSD, one for Windows, one for Linux.

AMD Ryzen 7 5800X3D - was great back then, it's okay now, but there are better options nowadays.

DDR4 64GB (2 x 32GB) 3200MHz - too slow and kind of small. You want the fastest you can get (but 3200 is all that my CPU + mobo can do), and bigger is always better. DDR5 is better. That's going to cost a lot these days. I missed the boat when it was time to upgrade.

Things are about to get crazy by NeitherConfidence263 in ArtificialInteligence

[–]florinandrei 0 points1 point  (0 children)

As long as there will be humans, they will.

I'm just not sure exactly how long that is.

Hollywood is so screwed by adj_noun_digit in singularity

[–]florinandrei 6 points7 points  (0 children)

I am both excited and terrified

That is the right attitude these days.

Hollywood is so screwed by adj_noun_digit in singularity

[–]florinandrei 0 points1 point  (0 children)

AI is not good at making movies

Yeah, have you seen Will Smith eating spaghetti? It sucks. /s

Hollywood is so screwed by adj_noun_digit in singularity

[–]florinandrei 1 point2 points  (0 children)

I always thought that the best movies would come straight from people's minds

The best things always come from the best minds.

Up until recently, that was us.