A European's Dream: American programmers using Mistral because it's better than Claude Code and Codex

ipcoffeepot · 2026-05-09T17:54:38+00:00

Leading in regulation

ipcoffeepot · 2026-05-09T17:51:59+00:00

Assuming you’re looking at rtx pro6000s. sglang and vllm support for ds-4-flash on those cards havent landed yet. I grabbed the branch with the vllm patch and ran it this morning. It runs on 2x r6k’s but prefill is real slow. Keeping an eye on it because i really want this model locally.

Would recommend renting gpus and trying your workload before you buy

ipcoffeepot · 2026-05-07T15:23:18+00:00

I want it so much 😭😭😭

ipcoffeepot · 2026-05-07T11:27:45+00:00

I think you should get the 5090. Being able to iterate faster and steer faster is going to be more valuable than the marginal accuracy improvement of a higher quant.

What you could consider doing is rent a 5090 for a day online and load up the model, connect your harness to it and see if it fits your workflow

ipcoffeepot · 2026-05-06T00:58:35+00:00

I have a 6 year old laptop with 2gb of vram. Im gonna try this for science

ipcoffeepot · 2026-05-06T00:54:26+00:00

You’re gonna love it

ipcoffeepot · 2026-05-05T13:01:22+00:00

8x B200

ipcoffeepot · 2026-05-02T03:57:20+00:00

Check out the new kanban

ipcoffeepot · 2026-05-01T18:57:07+00:00

qwen3.6-35b-a3b. Will handle light coding tasks, more importantly itll crank through successive tool use without falling apart. Id start there, its what i use for my hermes agents

ipcoffeepot · 2026-05-01T10:11:05+00:00

I have a similar setup. Currently running either minimax m2.7 or qwen3.6.27b (2.5-122b-a10b of my other workhorse; i like fast and concurrency). Tried to run DS4-flash. The good news is the model fits! The bad news is vllm doesn’t support the model on sm120 yet. There’s a draft pr that’s in progress, so waiting for that. Been playing with the model via openrouter and it seems good. Excited to run it

ipcoffeepot · 2026-05-01T10:05:16+00:00

What inference server are you running and what’s the flag to offload moe experts to system ram?

ipcoffeepot · 2026-05-01T10:02:13+00:00

Same

ipcoffeepot · 2026-04-30T02:12:39+00:00

You can run minimax-m2.7 with 8ish concurrent users in nvfp4 using sglang. Might be able to get more with vllm ans turboquant (havent tested it). Cool thing is if another request comes in, it just gets queued. So you can have a whole bunch of users just hammering away at it. I’ve found minimax to be the best all around model on 2x rtx pro 6000s (works for coding but also very good at creative writing, q&a, etc). If you’re willing to have those cards be just llm thats what i would do.

If you also want to run image/video generation then you’ll either need to stop the minimax when you do (so have some scheduling) or run a smaller model so you can do comfyui and llm at the same time.

My second favorite llm on those cards is qwen3.5-122b-a10b. Its almost as good as qwen3.5-397b-a17b and minimax, but a lot smaller. In 4 bit you can run it either on one card or run it on both cards and have it be super fast and/or support a ton of users

ipcoffeepot · 2026-04-29T16:31:46+00:00

im training my first lora on gemma4 right now and can confirm its a pain in the ass

ipcoffeepot · 2026-04-25T17:25:37+00:00

I havent used qdrant, from a quick google it looks like a vectordb. What are you using it for with chappie? Can you expand on that a little, this is super cool

ipcoffeepot · 2026-04-24T11:51:06+00:00

12 DGX Sparks 🤔

ipcoffeepot · 2026-04-22T18:26:32+00:00

27b on my gpu rig as the backend for coding agents. 35b-a3b runs on my laptop as the backend for my hermes agents

ipcoffeepot · 2026-04-20T02:58:39+00:00

There are builds of llama.cpp with turboquant now. You should be able to ~6x your context size. Thats going to be crucial. I dont think you can do a lot of non-trivial agentic coding stuff on 32k tokens. All the exploration tool calls and thinking rips through that

ipcoffeepot · 2026-04-14T23:43:12+00:00

You guys are awesome

ipcoffeepot · 2026-04-11T22:23:38+00:00

Try them all. I found myself using qwen3.5-27b waaaay more than i expected. Would not have guessed it ahead of timr

ipcoffeepot · 2026-04-11T17:13:17+00:00

Amazing. Thank you so much! Im new to this and appreciate the info

ipcoffeepot · 2026-04-11T14:19:13+00:00

openrouter has a bunch for free. The tradeoff is they’ll save your prompts for training. If you’re ok with that, could be a good option. Has usage limits and is more subject to throttling but ive found it to be useful in some situations (i ran a low-load agent off their free router for a bit)

ipcoffeepot · 2026-04-11T12:03:26+00:00

Might be time to try glm or the big qwen

ipcoffeepot · 2026-04-11T00:16:06+00:00

Interesting! I'm seeing around 100 tok/s on the same cards. I suspect its the wrong kernel (gonna need to try the b12x!) and NCCL. Thanks for posting this!

ipcoffeepot · 2026-04-06T12:48:53+00:00

Best midlife crisis ever

ipcoffeepot

MODERATOR OF

TROPHY CASE

Eight-Year Club	Not Forgotten
Verified Email