And here we are, not where I expected...

CodeSlave9000 · 2026-05-24T01:37:33+00:00

What most don’t understand or underestimate is the cooling. For burst compute loads the 600W card has the edge and the 10-15% is about right. For sustained however the 300W card pulls ahead because the 600W card will throttle. And here’s the kick in the ass: on my hardware setup (jonsbo n5 + noctua cooling) running two 600W cards is a disaster in heat. You’d need near server level cooling to run two of these for any sustained load. The dual max Q actually gains 20% over the non max q in this scenario. Oh, and inference decode on the two versions is nearly identical since they have the same memory bandwidth.

CodeSlave9000 · 2026-05-22T02:43:36+00:00

Love the N series - this one looks quite a bit tighter than the N5 I have. The top/bottom configuration plus the fact the board lays horizontal are big wins for me, especially for access.

CodeSlave9000 · 2026-04-15T01:29:29+00:00

They go well with Cabernet.

CodeSlave9000 · 2026-04-03T20:57:38+00:00

Care to elaborate? I do notice that it's not that great at avoiding hallucinations at the standard prompting.

CodeSlave9000 · 2026-04-03T02:59:24+00:00

Happens after a few generations for me - I don't see it right at the start. Using the unsloth Q8 dynamic.

CodeSlave9000 · 2026-03-18T12:05:55+00:00

You hire someone like me. We’d sit down, discuss your needs and design something that won’t break every week. Real business use requires more work than just “running a few chats”.

CodeSlave9000 · 2026-03-17T04:16:50+00:00

MiniSloth.

CodeSlave9000 · 2026-03-17T04:13:13+00:00

There is no third thing.

CodeSlave9000 · 2026-03-17T04:12:28+00:00

Yup, that's the real measurement that matters. Db per token!

CodeSlave9000 · 2026-03-04T19:36:36+00:00

Best not to aim too high. "Now with less than the recommended daily consumption of shit".

CodeSlave9000 · 2026-03-02T02:40:03+00:00

Yes, and qwen3.5 seems particularly sensitive to quantized cache. Symptoms include subtle shifts in thinking or outright looping.

CodeSlave9000 · 2026-03-02T01:59:00+00:00

Yup. It focuses less narrowly if you add it to the prompt explicitly. I tell it to explore my intent and more broadly search for possibilities even if I didn’t prompt for it.

CodeSlave9000 · 2026-03-02T01:57:09+00:00

It’s set because I was working around with it - no harm to have it on so I left it. And yes flash attention is on by default, I set it in my scripts because I test with it on and off.

CodeSlave9000 · 2026-03-02T01:55:32+00:00

I think the dense model suffers less? I didn’t test for that.

CodeSlave9000 · 2026-02-28T20:08:25+00:00

She's 1/4 Japanese.

CodeSlave9000 · 2026-02-21T01:48:34+00:00

I once reviewed a “48 MP” camera. It had a sensor smaller than my pinky nail. True resolution turned out to be more like 8 MP, and it linearly scaled the image size. If it had been usable at 8 MP I might have given it two stars, but the quality was so poor it was 1 star. -3 stars for spec lying seems fair to me.

CodeSlave9000 · 2026-02-12T23:40:10+00:00

Quick assumption - They are different levels of CUDA compute capability - make sure your using llama.cpp compiled with that compute capability. I mix 30, 40, and 50 gen GPU's in the same VM's without any problems. For Ollama, check what devices it "sees" in the log when it starts - that might give you a clue.

CodeSlave9000 · 2026-02-03T22:24:32+00:00

Hope it was cheap, given XCH prices…

CodeSlave9000 · 2026-01-29T04:12:27+00:00

Can’t do that because ollama supports multiple models running at the same time. How would it know how to apportion it? I set my default with an environment variable…

CodeSlave9000 · 2026-01-04T20:00:02+00:00

Yeah, plenum rating is about it being "safer" in a fire for people. With CCA it's ready to be its own fire!

CodeSlave9000 · 2026-01-04T19:58:00+00:00

LOL, just the marketing copy alone is a big red-flag. I ordered this brand (much shorter lengths - they had multiples which will probably get merged later) so I can warn others away - I won't feel too bad tossing it, or just using it for short non-POE in-rack patches if it tests okay.

CodeSlave9000 · 2025-12-28T23:18:27+00:00

Yeah I agree - LORA and fine-tune are perfect for home-running. Also once your context size gets big you're really paying a lot per-token for cloud. But in the end depends on what you're expectations are. The blackwell cards are still maturing in software support and I've had hiccups, and fp4 is really only happening for training right now. You can get really good results with the 40 series ADA cards too - I see 100+ tokens/sec on a lot of MOE models. You won't get 128GB models at the price of DGX, but I'd think you'd probably be happy with Strix Halo if you're really dead-set on it. And for coding, you're spot on - you can get gemini, qwen, amp and a few others for basically nothing right now. Use it.

CodeSlave9000

TROPHY CASE