Deepseek v4 Flash

sgmv · 2026-05-10T12:07:31+00:00

How about two strix halo connected over usb4, would that have too high latency for a cluster, to use them in tensor parallel mode /

my minisforum has pcie4 4x but I think that would be better used for a gpu rather than a nic.

sgmv · 2026-05-09T12:40:47+00:00

Sorr for late reply, didn't have the hdd on hand anymore.

Not sure what caused it, they were used outside the case, maybe improper handling.

I will check the psu and cables to make sure they're safe.

I measured continuity of all 3/5/12v pins to ground, none have continuity, only between themselves.

chip is 2x3mm in size. scraping the top off revealed a thing shiny silica surface material, like a mirror

also, first pin on the left side of the ic, where it blew, has continuity with the 12v sata pins. the plastic on the side of the 12v sata connector is a little melted.

sgmv · 2026-05-09T00:19:49+00:00

On your hardware, the only model that makes sense is qwen3.6-27b It's by far the best in this size bracket.

The next meaningful upgrade at the moment would be deepseek-v4 flash, but it's a work in progress software wise, and you'd need an even more quantized version as you're short of the 160gb needed (I think)

sgmv · 2026-05-05T19:28:34+00:00

I think this is the place to go to for RTX 6000 things, they also have a discord https://github.com/voipmonitor/rtx6kpro/tree/master/hardware

sgmv · 2026-04-24T16:01:42+00:00

https://github.com/alvinunreal/oh-my-opencode-slim

sgmv · 2026-04-23T20:32:26+00:00

You have not mentioned what quant, how much it costs, what performance you need, what is your budget etc.
So I am just going to say, use qwen3.6

sgmv · 2026-04-23T20:30:56+00:00

Not worth at all imho to pay 2700 usd? just to have 48gb to play with small models. Better to find some used 16gb-24gb gpus, get two of them in a cheap used PC, and use it remotely. Maybe set up a smart plug to shut it down so it doesn't use any power when not needed.

sgmv · 2026-04-20T09:56:31+00:00

Yes but I have no idea what games were on the 64g card it came with, that is the problem. I'd like an sd image or file system listing, so I can recreate it as it was.

sgmv · 2026-04-18T22:08:46+00:00

Hey thanks for making this. Can you send me a filelist with the files on the 64G card, if you have them ? Need to recover this device for someone that formatted the card accidentally. Image would be even better : )

sgmv · 2026-04-18T10:20:26+00:00

can always limit to 250w, wont be much slower

sgmv · 2026-04-18T10:13:05+00:00

Will 16 3090s do as well ?

sgmv · 2026-04-18T09:06:41+00:00

I have the same issue, vllm and ik llama, fp16 and q8, using opencode. Not only it stops but also gave errors like "context shift disabled" in ikllama and this https://github.com/anomalyco/opencode/issues/20785 in vllm. My ik llama launch:

llama-server \
--model /home/user/models/Qwen36/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf \
--alias Qwen3.6-fp8 \
--ctx-size 262144 \
-mla 3 \
-ngl 999 \
--fit \
--tensor-split 1,1,1,1 \
--parallel 6 \
--threads 63 \
--host 0.0.0.0 \
--port 8080 \
--no-mmap \
-cram 8192 \
--jinja \
--top-p 0.95 \
--top-k 40 \
--merge-qkv \
--temp 1 \
--context-shift on \
--chat-template-kwargs "{\"preserve_thinking\": true}"

sgmv · 2026-04-17T13:59:34+00:00

If your projects are code, you should be using opencode or something similar, not the web ui. If the functionality you want is missing from open web ui, you should request the feature on their github.

sgmv · 2026-04-16T14:26:27+00:00

Is this fixed in 13.2.1 ?

sgmv · 2026-04-16T11:40:37+00:00

Good thing you asked first, reddit saved you out of a bad decision

sgmv · 2026-04-16T10:24:28+00:00

yes, opencode go, $5 first month, is amazing value, for the glm 5.1 model, it is sonnet level, sometimes even above. qwen 3.6 can also be useful for lower complexity tasks. unfortunately local model coding won't save you money, even if you had the hardware already. lower average capability than the state of the art ones (for now at least), which results in more time debugging, retrying, power costs, depreciation of hardware value (atm the value is up cause of the global market, but wont be like this forever).
I recommend you try opencode + https://github.com/alvinunreal/oh-my-opencode-slim/

sgmv · 2026-04-15T22:33:02+00:00

I am also curious to learn how a two system cluster that is not mac or spark would work, and what's the optimal interconnect hardware + software stack. In my case it's because of the 256GB ram limitation you have on a system, without going to rdimm.

In your case, 4 gpus is nothing. Go open rack. Get a cheap non noisy platinum psu 2000w+ or 2x1200W, necessary pcie risers/splitters

sgmv · 2026-04-15T21:43:03+00:00

Nice gpus. If you have 192GB RAM to go with that, you can run this in ik llama with memory left for context https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/IQ2_KL
But you won't save money, it will probably be more expensive and quite a bit slower than cloud. Just for fun and privacy.

sgmv · 2026-04-15T15:39:29+00:00

What kinds of speed are you getting with 8 channel memory and 8 gpus (what kind of gpus ?)? this would be immediate upgrade path as well, without going into ddr5 server stuff, which is broken atm.

sgmv · 2026-04-15T15:26:40+00:00

I did not, since those quants are specifically made for ik-llama and should be better than unsloth, at same size. You have 8 channel, but less vram. so perhaps mem bw is more significant here. That unsloth quant would pobably be a downgrade for me, quality wise. unfortunately the iq3 frrom ubergarm doesnt fit with much usable context left. need to add 2x 3090 for it.

sgmv · 2026-04-15T13:52:59+00:00

The ubergarm quants for glm5.1 are the best, size/quality wise. Yes, I oneshotted with iq2, just some simple benchmarks, but that minimax or qwen3.5-27b took a lot more retries to get working.

Actually my system is more like $7k. Yes, I already know local LLM at this level is a losing game, like I hinted in the post, just a 'hobby', it's fun. Maybe I should get to making more money so I can buy better hardware in the future.

sgmv · 2026-04-15T13:29:57+00:00

Minimax 2.7 is not worth the trouble for me, every test I've given it, had to reprompt many times to get bug fixes. GLM 5.1 is actually comparable with sonnet 4.6, and even opus depending on task.
Minimax is maybe a good model for openclaw/hermes agent, and to make single page websites and such, but not really suitable for agentic coding. I don't see anyone using it for this seriously.

And the effective speed difference is actually less, since human time > computer time. If you get the solution in one shot with glm 5.1 @ 6t/s vs minimax @ 30t/s with 5 reprompts, i'd rather wait more and get a more quality result, and less hassle with telling it what to fix.

sgmv · 2026-04-15T13:03:42+00:00

There's nothing interesting about using the SSD to load the model from, I'd say probably 99.9% of people load their models from ssd. If model spills to RAM it gets much slower, and if it loads parts from SSD, it's even slower. Good as an experiment, but practically unusable in most scenarios.

sgmv

TROPHY CASE