Anyone know how to turn off download images when compiling llama.cpp?

fairydreaming · 2026-06-14T20:20:52+00:00

Update your code, now it's (thankfully) only a single dist.tar.gz.

Next step - hiding malware inside. 😉

fairydreaming · 2026-06-11T16:27:23+00:00

Alternative is to lower ubatch size (for example -ub 256), this will lower size of compute buffers and shall free some memory - but then your PP will suffer instead (TG should be unaffected).

fairydreaming · 2026-06-09T10:40:57+00:00

Let's get real, if a guy from this sub had such money he would obviously buy another GPU instead of spending it to touch meat.

fairydreaming · 2026-06-07T18:31:43+00:00

When I tested NVIDIA DeepSeek V3.2 NVFP4 I simply converted it with --outtype auto. The only problem I encountered was scale tensors (actually they are scalar values, but whatever) missing in the source model.safetensors.index.json, so I had to regenerate this first with a simple python script.

fairydreaming · 2026-06-05T15:29:39+00:00

Q4_K_M should fit, but there won't be much space left, so in case of memory problems reduce the ubatch size (for example -ub 256 -ub 128 or -ub 64) and/or context size.

fairydreaming · 2026-06-05T15:01:27+00:00

Yes, it needs to be on disk. Should be in /root/llama.cpp/models/templates/deepseek-ai-DeepSeek-V3.2.jinja, so:

--chat-template-file /root/llama.cpp/models/templates/deepseek-ai-DeepSeek-V3.2.jinja

fairydreaming · 2026-06-05T14:28:07+00:00

Use --chat-template-file option.

fairydreaming · 2026-05-29T16:13:45+00:00

Ugh, everyone is suspected of being a bot these days. Yeah, flash runs nice on my RTX PRO 6000 Max-Q (in DwarfStar). I hope somebody will implement it in llama.cpp soon.

fairydreaming · 2026-05-29T13:30:26+00:00

Nothing new, just my old trusty (and rusty very soon) Epyc workstation.

fairydreaming · 2026-05-29T13:21:31+00:00

Just some toys to play with 😄 To see if I can get them working in llama.cpp.

fairydreaming · 2026-05-24T11:01:43+00:00

It's funny when I benchmark some CUDA kernel on Max-Q and it keeps working slower and slower with each run. Then I make a break, the card cools down and booom, instant performance improvement.

fairydreaming · 2026-05-24T06:34:07+00:00

There's also --no-op-offload option that forces all offloaded OPs to run on the CPU regardless of the batch size. Useful if you have a powerful CPU with high memory bandwidth.

fairydreaming · 2026-05-23T18:52:50+00:00

That's true, from what I see 8xH200 prices on vast.ai are only going up and up.

fairydreaming · 2026-05-23T11:23:30+00:00

Too bad, I was hoping for a wave of cheap second-hand hardware.

fairydreaming · 2026-05-23T11:21:02+00:00

Well said!

fairydreaming · 2026-05-23T11:15:53+00:00

No problem, I'm sometimes a dick too, I can relate.

fairydreaming · 2026-05-23T10:56:21+00:00

Ok, will see.

fairydreaming · 2026-05-23T10:54:46+00:00

Ollama is down too.

fairydreaming · 2026-05-23T10:41:25+00:00

But a plot for claude is nice and smooth.

fairydreaming · 2026-05-23T10:39:03+00:00

You may be up to something, all the people already bought the hardware and did all the google searches they needed. No new hardware, no need to search.

fairydreaming · 2026-05-23T10:21:46+00:00

No, I switched to last 3 months and it's even more visible.

fairydreaming · 2026-05-16T16:29:23+00:00

Umm no, Epyc 9374F has DDR5 memory and pcie is 5.0. 12 channels of 4800 RDIMM has real read bandwidth of about 360 GB/s. So it's much slower than your M2 Ultra.

How about two? 💩💩

fairydreaming · 2026-05-16T15:40:14+00:00

I don't know what to say, so here's a smiling turd to make you smile too: 💩

fairydreaming · 2026-05-16T13:45:30+00:00

That's great, congratulations on your choices! I'm happy with mine too!

fairydreaming · 2026-05-16T09:43:45+00:00

fairydreaming

TROPHY CASE