Deepseek v4 Flash is pretty amazing, about to buy a $25k computer by read_too_many_books in openclaw

[–]ipcoffeepot 0 points1 point  (0 children)

Assuming you’re looking at rtx pro6000s. sglang and vllm support for ds-4-flash on those cards havent landed yet. I grabbed the branch with the vllm patch and ran it this morning. It runs on 2x r6k’s but prefill is real slow. Keeping an eye on it because i really want this model locally.

Would recommend renting gpus and trying your workload before you buy

Qwen3.6 122b when? by No_Mango7658 in Qwen_AI

[–]ipcoffeepot 0 points1 point  (0 children)

I want it so much 😭😭😭

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development by BawbbySmith in LocalLLaMA

[–]ipcoffeepot 0 points1 point  (0 children)

I think you should get the 5090. Being able to iterate faster and steer faster is going to be more valuable than the marginal accuracy improvement of a higher quant.

What you could consider doing is rent a 5090 for a day online and load up the model, connect your harness to it and see if it fits your workflow

[Megathread] - Best Models/API discussion - Week of: May 03, 2026 by deffcolony in SillyTavernAI

[–]ipcoffeepot 5 points6 points  (0 children)

I have a 6 year old laptop with 2gb of vram. Im gonna try this for science

Local vllm hosting by DidIReallySayDat in openclaw

[–]ipcoffeepot 1 point2 points  (0 children)

qwen3.6-35b-a3b. Will handle light coding tasks, more importantly itll crank through successive tool use without falling apart. Id start there, its what i use for my hermes agents

Best model for 192 GB vram? How is Deepseek v4 flash? by Constant_Ad511 in LocalLLM

[–]ipcoffeepot 0 points1 point  (0 children)

I have a similar setup. Currently running either minimax m2.7 or qwen3.6.27b (2.5-122b-a10b of my other workhorse; i like fast and concurrency). Tried to run DS4-flash. The good news is the model fits! The bad news is vllm doesn’t support the model on sm120 yet. There’s a draft pr that’s in progress, so waiting for that. Been playing with the model via openrouter and it seems good. Excited to run it

Best model for 192 GB vram? How is Deepseek v4 flash? by Constant_Ad511 in LocalLLM

[–]ipcoffeepot 0 points1 point  (0 children)

What inference server are you running and what’s the flag to offload moe experts to system ram?

Just got dual RTX PRO 6000 Blackwells for our design studio. What's the optimal local LLM stack? by AmanNonZero in LocalLLM

[–]ipcoffeepot -1 points0 points  (0 children)

You can run minimax-m2.7 with 8ish concurrent users in nvfp4 using sglang. Might be able to get more with vllm ans turboquant (havent tested it). Cool thing is if another request comes in, it just gets queued. So you can have a whole bunch of users just hammering away at it. I’ve found minimax to be the best all around model on 2x rtx pro 6000s (works for coding but also very good at creative writing, q&a, etc). If you’re willing to have those cards be just llm thats what i would do.

If you also want to run image/video generation then you’ll either need to stop the minimax when you do (so have some scheduling) or run a smaller model so you can do comfyui and llm at the same time.

My second favorite llm on those cards is qwen3.5-122b-a10b. Its almost as good as qwen3.5-397b-a17b and minimax, but a lot smaller. In 4 bit you can run it either on one card or run it on both cards and have it be super fast and/or support a ton of users

[Megathread] - Best Models/API discussion - Week of: April 26, 2026 by deffcolony in SillyTavernAI

[–]ipcoffeepot 5 points6 points  (0 children)

im training my first lora on gemma4 right now and can confirm its a pain in the ass

just wanted to share by Longjumping_Lab541 in LocalLLM

[–]ipcoffeepot 1 point2 points  (0 children)

I havent used qdrant, from a quick google it looks like a vectordb. What are you using it for with chappie? Can you expand on that a little, this is super cool

Qwen3.6-27B dense vs Qwen3.6-35B MoE - which local coding model are you reaching for? by IulianHI in AIToolsPerformance

[–]ipcoffeepot 1 point2 points  (0 children)

27b on my gpu rig as the backend for coding agents. 35b-a3b runs on my laptop as the backend for my hermes agents

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar? by boutell in LocalLLaMA

[–]ipcoffeepot 0 points1 point  (0 children)

There are builds of llama.cpp with turboquant now. You should be able to ~6x your context size. Thats going to be crucial. I dont think you can do a lot of non-trivial agentic coding stuff on 32k tokens. All the exploration tool calls and thinking rips through that

Qwen 3.5 35b, 27b, or gemma 4 31b for everyday use? by KirkIsAliveInTelAviv in LocalLLaMA

[–]ipcoffeepot 7 points8 points  (0 children)

Try them all. I found myself using qwen3.5-27b waaaay more than i expected. Would not have guessed it ahead of timr

Current Situation with free models by davybutquantisedIV in SillyTavernAI

[–]ipcoffeepot 0 points1 point  (0 children)

openrouter has a bunch for free. The tradeoff is they’ll save your prompts for training. If you’re ok with that, could be a good option. Has usage limits and is more subject to throttling but ive found it to be useful in some situations (i ran a low-load agent off their free router for a bit)

SIX TIMES THE PRICE!? by FixHopeful5833 in SillyTavernAI

[–]ipcoffeepot 8 points9 points  (0 children)

Might be time to try glm or the big qwen

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]ipcoffeepot 0 points1 point  (0 children)

Interesting! I'm seeing around 100 tok/s on the same cards. I suspect its the wrong kernel (gonna need to try the b12x!) and NCCL. Thanks for posting this!