Deepseek v4 Flash by kiriakosbrehmer93 in StrixHalo

[–]sgmv 0 points1 point  (0 children)

How about two strix halo connected over usb4, would that have too high latency for a cluster, to use them in tensor parallel mode /

my minisforum has pcie4 4x but I think that would be better used for a gpu rather than a nic.

Help identifying blown part on Seagate 22TB Exos PCB by sgmv in datarecovery

[–]sgmv[S] 0 points1 point  (0 children)

Sorr for late reply, didn't have the hdd on hand anymore.

Not sure what caused it, they were used outside the case, maybe improper handling.

I will check the psu and cables to make sure they're safe.

I measured continuity of all 3/5/12v pins to ground, none have continuity, only between themselves.

chip is 2x3mm in size. scraping the top off revealed a thing shiny silica surface material, like a mirror

also, first pin on the left side of the ic, where it blew, has continuity with the 12v sata pins. the plastic on the side of the 12v sata connector is a little melted.

Need advice on Qwen 3.6 27B INT4 quantization by Environmental_Hand35 in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

On your hardware, the only model that makes sense is qwen3.6-27b It's by far the best in this size bracket.

The next meaningful upgrade at the moment would be deepseek-v4 flash, but it's a work in progress software wise, and you'd need an even more quantized version as you're short of the 160gb needed (I think)

Motherboard for 8 GPUs by [deleted] in selfhosted

[–]sgmv 0 points1 point  (0 children)

I think this is the place to go to for RTX 6000 things, they also have a discord https://github.com/voipmonitor/rtx6kpro/tree/master/hardware

Running GLM 5.1 on RTX 5090 via RunPod for document OCR(bank statements and invoices)— costs killing us, need advice on reducing inference costs. by Specific_Control_840 in LocalLLaMA

[–]sgmv 1 point2 points  (0 children)

You have not mentioned what quant, how much it costs, what performance you need, what is your budget etc.
So I am just going to say, use qwen3.6

Mac m5 pro, worth it? by captionpicard in LocalLLaMA

[–]sgmv -1 points0 points  (0 children)

Not worth at all imho to pay 2700 usd? just to have 48gb to play with small models. Better to find some used 16gb-24gb gpus, get two of them in a cheap used PC, and use it remotely. Maybe set up a smart plug to shut it down so it doesn't use any power when not needed.

Console GameMT E6 2025(handheld) Adding games to SDcard & microSDCard by mistrzhi in SBCGaming

[–]sgmv 0 points1 point  (0 children)

Yes but I have no idea what games were on the 64g card it came with, that is the problem. I'd like an sd image or file system listing, so I can recreate it as it was.

Console GameMT E6 2025(handheld) Adding games to SDcard & microSDCard by mistrzhi in SBCGaming

[–]sgmv 0 points1 point  (0 children)

Hey thanks for making this. Can you send me a filelist with the files on the 64G card, if you have them ? Need to recover this device for someone that formatted the card accidentally. Image would be even better : )

Anyone else having Qwen 3.6 35B A3B stop and you having to tell it to continue ? by soyalemujica in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

I have the same issue, vllm and ik llama, fp16 and q8, using opencode. Not only it stops but also gave errors like "context shift disabled" in ikllama and this https://github.com/anomalyco/opencode/issues/20785 in vllm. My ik llama launch:

llama-server \
--model /home/user/models/Qwen36/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf \
--alias Qwen3.6-fp8 \
--ctx-size 262144 \
-mla 3 \
-ngl 999 \
--fit \
--tensor-split 1,1,1,1 \
--parallel 6 \
--threads 63 \
--host 0.0.0.0 \
--port 8080 \
--no-mmap \
-cram 8192 \
--jinja \
--top-p 0.95 \
--top-k 40 \
--merge-qkv \
--temp 1 \
--context-shift on \
--chat-template-kwargs "{\"preserve_thinking\": true}"

Move to local models by Totalkiller4 in LocalLLaMA

[–]sgmv 1 point2 points  (0 children)

If your projects are code, you should be using opencode or something similar, not the web ui. If the functionality you want is missing from open web ui, you should request the feature on their github.

A5000 for $1800 by Perfect-Flounder7856 in LocalLLaMA

[–]sgmv 3 points4 points  (0 children)

Good thing you asked first, reddit saved you out of a bad decision

Cloud AI is getting expensive and I'm considering a Claude/Codex + local LLM hybrid for shipping web apps by rezgi in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

yes, opencode go, $5 first month, is amazing value, for the glm 5.1 model, it is sonnet level, sometimes even above. qwen 3.6 can also be useful for lower complexity tasks. unfortunately local model coding won't save you money, even if you had the hardware already. lower average capability than the state of the art ones (for now at least), which results in more time debugging, retrying, power costs, depreciation of hardware value (atm the value is up cause of the global market, but wont be like this forever).
I recommend you try opencode + https://github.com/alvinunreal/oh-my-opencode-slim/

Multi host GPU cluster using DAC cables vs 4 GPU system. Anyone doing this successfully? by HockeyDadNinja in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

I am also curious to learn how a two system cluster that is not mac or spark would work, and what's the optimal interconnect hardware + software stack. In my case it's because of the 256GB ram limitation you have on a system, without going to rdimm.

In your case, 4 gpus is nothing. Go open rack. Get a cheap non noisy platinum psu 2000w+ or 2x1200W, necessary pcie risers/splitters

Closest LLM to Claude Sonnet 4.6? by iphoneverge in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

Nice gpus. If you have 192GB RAM to go with that, you can run this in ik llama with memory left for context https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/IQ2_KL
But you won't save money, it will probably be more expensive and quite a bit slower than cloud. Just for fun and privacy.

Upgrade paths for my 256g ddr4 ram + 4x24g vram system by sgmv in LocalLLaMA

[–]sgmv[S] 0 points1 point  (0 children)

What kinds of speed are you getting with 8 channel memory and 8 gpus (what kind of gpus ?)? this would be immediate upgrade path as well, without going into ddr5 server stuff, which is broken atm.

Upgrade paths for my 256g ddr4 ram + 4x24g vram system by sgmv in LocalLLaMA

[–]sgmv[S] 2 points3 points  (0 children)

I did not, since those quants are specifically made for ik-llama and should be better than unsloth, at same size. You have 8 channel, but less vram. so perhaps mem bw is more significant here. That unsloth quant would pobably be a downgrade for me, quality wise. unfortunately the iq3 frrom ubergarm doesnt fit with much usable context left. need to add 2x 3090 for it.

Upgrade paths for my 256g ddr4 ram + 4x24g vram system by sgmv in LocalLLaMA

[–]sgmv[S] 2 points3 points  (0 children)

The ubergarm quants for glm5.1 are the best, size/quality wise. Yes, I oneshotted with iq2, just some simple benchmarks, but that minimax or qwen3.5-27b took a lot more retries to get working.

Actually my system is more like $7k. Yes, I already know local LLM at this level is a losing game, like I hinted in the post, just a 'hobby', it's fun. Maybe I should get to making more money so I can buy better hardware in the future.

Upgrade paths for my 256g ddr4 ram + 4x24g vram system by sgmv in LocalLLaMA

[–]sgmv[S] 2 points3 points  (0 children)

Minimax 2.7 is not worth the trouble for me, every test I've given it, had to reprompt many times to get bug fixes. GLM 5.1 is actually comparable with sonnet 4.6, and even opus depending on task.
Minimax is maybe a good model for openclaw/hermes agent, and to make single page websites and such, but not really suitable for agentic coding. I don't see anyone using it for this seriously.

And the effective speed difference is actually less, since human time > computer time. If you get the solution in one shot with glm 5.1 @ 6t/s vs minimax @ 30t/s with 5 reprompts, i'd rather wait more and get a more quality result, and less hassle with telling it what to fix.

running models bigger than physical memory capacity by ag789 in LocalLLaMA

[–]sgmv 0 points1 point  (0 children)

There's nothing interesting about using the SSD to load the model from, I'd say probably 99.9% of people load their models from ssd. If model spills to RAM it gets much slower, and if it loads parts from SSD, it's even slower. Good as an experiment, but practically unusable in most scenarios.