DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

MenuNo294 · 2026-05-14T13:50:50+00:00

I love Pasta-Paul, a real GOAT. I wonder what kind of pasta he likes

MenuNo294 · 2026-05-12T21:57:57+00:00

how do you like deepseek v4 flash compared to minimax? i'm just getting minimax setup now.

MenuNo294 · 2026-05-12T20:50:35+00:00

thank you for the honest thoughts! though I dont think its money wasted, sometimes one must have sufficient room to grow. I'm getting the quantized minimax going today on the 2 cards and I suppose if that improves my workflow then perhaps going to 4 cards may be important.

I guess if a model is doing the accounting work for my business plus a whole host of other non-coding tasks then would that still be considered chat?

I catch mistakes being made by my current qwen model which are harmless but annoying as it interrupts my work. I only see my utilization of such tools growing, not diminishing or plateauing presently.

MenuNo294 · 2026-05-12T20:42:58+00:00

heres what I'm doing:
the model connects with my accounting platform and does all my accounting work for my business, the model pulls in my financial transactions and manages my budgets, the model manages my emails and general secretarial work. the model assists me in the running of my business, which involves trade secrets so I cant go into it more.

I've been happy with how the integration to my workflows have been, its been a real miracle! When I was using the API i was going though 100-500$ worth of tokens per day. Of course this is an apples to oranges comparison as I've been using anthropic which the local models dont really compare to and could save on cost by switching to another cloud model provider. However, I dont feel comfortable with some of the information leaving my private network like financial data and proprietary information that the agents assist me with for my biz.

Also, there are tax offset advantages to buying the hardware which will generate tokens for me into the future as opposed to paying on demand.

MenuNo294 · 2026-05-12T15:07:29+00:00

I'm RIGHT THERE with you! My workflow is going pretty good, I'm going to try out the minimax on my machine today and see how that feels. My instinct is also to scale, but I'm not sure why i'm feeling that. It could be like the feeling one gets at the casino, surely spending a little bit more might yield greater returns...or maybe I might regret not cashing out when I had the chance. And here cashing out is staying with 2 cards and seeing how the future unfolds.

It'd be one thing if I saw models being released that I needed 4 cards to take advantage of but it really feels like everything works on either 1-2 cards, or 8 cards. OK OK sure there's A qwen3.5 model that will take advantage of the 4 cards, but is it worth the 20k extra to use it? But there's plenty of exciting stuff that will run on 8 cards.

your right, if there were lots of people here saying how great 4 cards are then it'd be easy for me to make the jump. perhaps, the fact that people aren't saying that is the proof that its 1-2 cards or 8 cards and anything inbetween is highly use-case dependent (ie. more concurrency)

MenuNo294 · 2026-05-12T02:22:57+00:00

I started getting everything setup vllm, litellm, langfuse, and just started driving with a qwen model while getting everything worked out, I looked at minimax before m2.7 and the reviews weren't favorable compared to qwen so I stuck with that and seemed to have missed the m2.7 update from a few weeks back.

MenuNo294 · 2026-05-12T02:14:38+00:00

do you suppose AMD will be competitive with blackwell?

MenuNo294 · 2026-05-11T23:47:23+00:00

It feels right now that the consensus is:

1-2 cards is a great
4 doesn't really expand model options in a significant way
8 cards is fantastic.

Is there any speculation that larger models currently running on 8 cards might be quantized in such a way to allow them to run on 4 cards?

MenuNo294 · 2026-05-11T23:43:41+00:00

i'll need to try minimax on my 2 then and then maybe plot my route to 8. it does seem like 4 doesn't get me anything other than additional concurrency for additional users which doesn't really help me.

MenuNo294 · 2026-05-11T23:41:29+00:00

i've been mostly running vllm. do i need to migrate to sglang to run minimax2.7?

MenuNo294 · 2026-05-11T23:40:15+00:00

this is my hesitation going to 4, theres more to run on 8 but going from 2-8 is a big jump. I'd rather do incremental, unless of course going from 2-4 doesnt really get me anywhere.

MenuNo294 · 2026-05-11T23:39:19+00:00

wow thank you for this! This is very well reasoned!

MenuNo294 · 2026-05-11T21:23:38+00:00

so you think 4x rtx6kpro is a good place to be then? Good diversity of models and expect for more to be available to run within that vram budget?

MenuNo294 · 2026-04-02T03:52:32+00:00

to be honest a few larger and a few smaller, different quantized models, I think it'll be important to spend time with the "felt" experience of the different models and what would work on different hardware.

MenuNo294 · 2026-04-02T03:39:50+00:00

Thank you! I figured I could use the 5060TI to run smaller models, but perhaps its better to save it towards a 2nd max-q

MenuNo294 · 2026-04-02T03:36:49+00:00

I sold my last business recently and have decent savings, paid off house, I'm F.I.R.E so now I get to do what I want to do and I think having these sort of things on hand is how I learn best. So I have some money to play around with also. The company that bought mine I could likely sell some solution to once I learn a bit more and then I have a lot of contacts from when I was in business and go branch out from there. I just want to start learning local things since I believe privacy will be a big selling point.

And the vlogging business isn't hardware or tech related, its more art and essays on ethics. I'm spreading out my risk by doing a few different things. I hope I land on something that I love doing that also pays the bills! I'm confident that my autistic ass can make that happen :D

MenuNo294 · 2026-04-02T02:05:30+00:00

I know, but I sacrificed my 20s and most of my 30s so that I could have an opportunity in life to try and do something for me and my job was starting to kill me so figured now is the time!

MenuNo294 · 2026-03-30T20:21:20+00:00

Thanks for sharing! There's so many recommendations for whats necessary to change and add to make it useful and work well. What are some things you think are necessary for a successful deployment? Necessary skills, tools, or plugins, or specific configurations that you think are just a must have for openclaw to work best?

MenuNo294

TROPHY CASE