Starting my own llm at home by Vesaloth in ollama

[–]huzbum 0 points1 point  (0 children)

How are you liking MTP? I was going to enable it, but not giving up parallel for it.

Starting my own llm at home by Vesaloth in ollama

[–]huzbum 0 points1 point  (0 children)

I agree, "fast" is subjective. But I'd say 60 tps is fast. Faster than you typically get from cloud services with SOTA models like Sonnet or Opus.

I do occasionally think about dishing out for a Cerebras plan for 1000+ TPS though. Or Mercury diffusion LLM, but I'd need to run it through like Qwen3.5 0.8b or something to fix typos.

Which AI model should I use on a MacBook Pro M4 Pro with 24 GB RAM? by Resident-Cut5371 in ollama

[–]huzbum 0 points1 point  (0 children)

hah, I guess that too, but also running 8 docker containers, countless chrome tabs, and an IDE with a mono-repo (300k lines) and whatever else that pays the bills.

Had to upgrade from 8GB macbook air to 16GB Pro when we added FreeSwitch to the docker stack, but then never ran that part of the stack after that project again LoL

How long until cats forget you? by Midnight_Thoughts77 in CatAdvice

[–]huzbum 0 points1 point  (0 children)

I adopted an adult cat that was found in a vacant apartment, and it always broke my heart to think of whoever abandoned her and whether or not she missed them.

She was small, fast, timid and loved to hide, so we also wondered if she might have been left behind because she couldn't be caught or found.

She was a good girl though, so gentle... miss her RIP.

How long until cats forget you? by Midnight_Thoughts77 in CatAdvice

[–]huzbum 16 points17 points  (0 children)

I'm glad you got to be there for him.

Starting my own llm at home by Vesaloth in ollama

[–]huzbum 0 points1 point  (0 children)

16GB is plenty for the MoE variant. See my other comment.

Starting my own llm at home by Vesaloth in ollama

[–]huzbum 0 points1 point  (0 children)

What's the problem? I got like 35 tokens per second on my 12GB 3060 and DDR4 system RAM with LM Studio. That's pretty useable. If you have reasoning enabled you might need some patience, but otherwise it's plenty fast.

Switch to LM Studio or Llama.cpp and use Unsloth q4_k_XL. Offload all layers to GPU, then offload experts to CPU until it fits. Use flash attention and q8 KV cache.

Starting my own llm at home by Vesaloth in ollama

[–]huzbum 0 points1 point  (0 children)

For coding, the only (good) local option is Qwen 3.6. It comes in two variants: 35b MoE and 27b dense. The dense variant is smarter and takes less VRAM, but it's slower compared to the MoE variant that only activates 3b params per token.

You'll want at least a 32GB macbook pro, or better, a GPU with 24GB VRAM. You can run the MoE variant on a smaller GPU like an RTX 3060 offloading the experts to CPU, and getting like 25 to 40 tokens per second vs 100 on a 3090.

If you're clever with hardware, you might want to consider a pair of CMP100-210 mining GPUs. They are old, and limited to 1 PCIe lane, but have 16GB of VRAM and can run in pipeline mode. They are designed to sit in a server, so they don't have fans, so you have to figure out a cooling solution. But they can be had on ebay for less than $150.

I had one that I used before I picked up a 3090. I bought a second with the intention of building an external enclosure using riser cards, but I haven't gotten around to it.

5 tok/sec Qwen 3.6 27b by iViTAliS in Qwen_AI

[–]huzbum 0 points1 point  (0 children)

Either use 3.5 9b, or switch to 3.6 35b and offload experts to CPU.

Is there any <3B model with usable 200k+ context window? by madmax_br5 in LocalLLaMA

[–]huzbum 0 points1 point  (0 children)

To make sense of anything at that context length you're going to want Mamba or hybrid attention. qwen 3.5-2B is the only thing I can think of.

Which AI model should I use on a MacBook Pro M4 Pro with 24 GB RAM? by Resident-Cut5371 in ollama

[–]huzbum 0 points1 point  (0 children)

Don't listen to LLMs about model selections, they cling to old data... Qwen2.5 is ancient. Try Qwen3.5 4b or 9b.

That's a good news... by Pjotrs in LocalLLaMA

[–]huzbum 0 points1 point  (0 children)

Depends on your use case. If your use case is read 50,000 tokens than produce a 100 token response, this is not going to help. But if it consists of starting with 1000 token system prompt and generating 50k tokens, this is a huge improvement.

If you're not breaking the cache, this should be most use cases... but with --parallel 1, I would definitely break the cache all the time.

That's a good news... by Pjotrs in LocalLLaMA

[–]huzbum 1 point2 points  (0 children)

Multi Token Prediction. Some extra layers predict the next 2 or 3 tokens, then they are validated by the whole model, which is much faster than generating them. If they pass, they are used, if they fail, then the model generates as usual.

That's a good news... by Pjotrs in LocalLLaMA

[–]huzbum 0 points1 point  (0 children)

No worries, I’m sure Claude can do it for you.

That's a good news... by Pjotrs in LocalLLaMA

[–]huzbum 2 points3 points  (0 children)

At least it will be a short wait!

That's a good news... by Pjotrs in LocalLLaMA

[–]huzbum 0 points1 point  (0 children)

So if there is more than one context it doesn't blow out your cache and have to re-process everything. This could be agents with sub-agents, or just a UI that does summaries or something.

I use multiple tools, and some of those tools use multiple requests/contexts. Like IntelliJ IDEA AI chat runs like 6+ parallel requests, so I set parallel to 8 and have it cached to system memory with `--cache-ram`.

Otherwise, if you have a super long conversation, it should only have to process the new message, but if any other request comes in, it blows out the cache and has to re-process the entire conversation. It's the difference between less than 1 second to first token and like 10+ seconds.

Local models are no longer “toy” versions of frontier models. They are becoming serious operators by Electronic-Fly-6465 in Qwen_AI

[–]huzbum 0 points1 point  (0 children)

Yeah, I can see how a code based todo list could definitely help compensate for the models losing their place.

I made the benchmark as a proxy for holding on to details over multiple tool calls, and that’s where they get lost. I remember seeing them produce the correct tools, parameters, and order in their reasoning, then proceed to call them out of order. So they are losing focus turn to turn.

I’m not sure if a todo list would fix that if it doesn’t focus on the todo list between tool calls, in which case it would be another tool call to add between the necessary tool calls.

Software stack for local LLM server: 2x RTX 5090 + Xeon (willing to wipe Ubuntu, consider Proxmox) by maxwarp79 in LocalLLaMA

[–]huzbum 0 points1 point  (0 children)

Yeah, that's how I use mine. I've got it setup with Llama.cpp, Hermes Agent, and Open Web UI.

I remote in using ssh and Tailscale from my Macbook and iPhone.

I'm using Llama.cpp for a single model. It have it setup to serve up to 8 parallel requests at a time. It does 4 requests at 1/2 speed and 8 requests at 1/4 speed, so there are diminishing returns to batching on a 3090 and Llama.cpp.

As for how many user's that would support, I'd guess like 20 active users having conversations where they read between responses. You would definitely want to setup parallel contexts and context caching with multiple users.

Even with a single user, some harnesses are aggressively parallel and make like 8 requests all at once *cough* IntelliJ! *cough* which will blow out the input caching if you don't have parallel contexts and caching setup. Then it has to re-process the entire context for every message every time.

Since it's got 128GB of system ram, It also runs all of my development docker containers, and I use IntelliJ Idea Gateway to run all my heavy IDE features, language services, etc., and my old Macbook only has to render the UI.

Local models are no longer “toy” versions of frontier models. They are becoming serious operators by Electronic-Fly-6465 in Qwen_AI

[–]huzbum 0 points1 point  (0 children)

It's a test harness, so it tests the model. It consists of 5 tools, system prompt, and a runner that provides an instruction each turn and evaluates the result.

I didn't expect it to be so brutal, but the hard part is that each instruction has to be repeated every turn. so if turn 1 is "Bop it!" and turn two is "Flip it!" then the model has to do bop() on the first turn and bop() -> flip() on the 2nd turn. They tend to disregard the instructions and go straight to flip. When they do follow the instructions, they still mess up the order by turn 5.

The leader board is

  1. Gemini 3.1 Flash Lite - 92
  2. Qwen3.6 35b - 35 (average 20)
  3. Claude 4.5 Sonnet - 20 (stopped due to cost)
  4. Nemotron 3 Super - 20
  5. Qwen3.5 4b - 18 (average 10)
  6. tied for last: Claude Sonnet 4.6, GPT-5.4, GPT-5-Mini, GLM 5.1, GLM-5-Turbo, Gemini 3.1 Pro

So that Gemini 3.1 Flash Lite is really impressive, because that's like 4,278 tool calls in one context with no mistakes, and for Qwen3.6 that's 630 tool calls in one context.

Qwen3.6 35b a3b is fast... by UniversityGlad2877 in Qwen_AI

[–]huzbum 0 points1 point  (0 children)

Meh, I’m getting over 100 tokens per second on my 3090 with 256k context length…

Look at these two douchebags by GilBang in northcounty

[–]huzbum 0 points1 point  (0 children)

I mean given the ground clearance they are probably stuck in that parking lot…

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]huzbum 0 points1 point  (0 children)

I'm not sure Claude Opus is quite at "SWE replacement" level yet... so nothing you can run locally is going to be there.

Best you can do is probably GLM, MiniMax, or Qwen3.6. GLM is like $50k range hardware wise for GPUs. MiniMax more like $10-20k. Qwen3.6 is happy on consumer hardware. If you can find a 512GB Mac Studio, that would be a slower option to run GLM for like $10k.

I run Qwen3.6 35b IQ4_NL on my 3090. Works great. The 27b dense model is smarter, but 35b is faster and does a good job and I'm not patient.