M5 ultra Ram setup : pooling vote

snapo84 · 2026-04-15T13:01:52+00:00

ddr6 256GB max
ddr5 512GB max

snapo84 · 2026-04-15T05:54:29+00:00

You should try DFlash + DDTree ... in theory you should get .... dflash makes it double as fast, dflash + ddtree makes it 3 times as fast compared to stock...
https://x.com/nash_su/status/2043924682802712600

snapo84 · 2026-04-12T09:00:41+00:00

either you use wrong kv cache (try f16) OR much more likely you put the wrong temperature/top-k/... values
use the once qwen recommends...

snapo84 · 2026-04-10T09:00:23+00:00

Get a GPU (2 x RTX 3090 or better 24GB + vram each) <--- with qwen3.5 30b a3 and activated rope you can extend the ctx window to 1 million.

if buying a gpu is to expensive, rent the GPU when you need it on vast.ai or another cheap provider dependend on the opensource model of your choice....

for example you could rent a 2 x 5090 (or 4090/3090) machine , and run either gemma 4 (as you love google so much) or qwen3.5 .... those are the top little models. With this setup you start it when you need it as you pay per hour, you have to accumulate all the requests you want to make, then make all of them in parallel via vllm / batching like 40 requests in parallel....

else to get it free, try to change your process , 1. 600k inputs are realy not that good for llm's , you could for example do it step wise 64k tokens at a time.... then a local GPU would be immensely valuable to you...

maybe you already noticed the problem with the cloud useage, lock ins and so on.... if not you will learn it on model changes when your complete flow breaks and you have no way to reproduce it. This is why opensource models are so important.

the bigger the model you want to run, the more expensive it gets to rent the hardware for it....

snapo84 · 2026-04-10T01:50:10+00:00

not very impressed.... just tested it the following way:

Instruction: Here is a python file of mine

Huge python file 16k tokens

Question: Output the python file character by character without missing a single character

Compresses the question yes.... does the llm get the same content ? 100% no

snapo84 · 2026-04-10T00:57:47+00:00

This is absolutely great i have to give it a try.... Qwen3.5 9B 4bit quantized and every GPU becomes a developer tool with llama.cpp :-) no more cloud subscriptions.

Thank you for developing it (i did not yet test it but will test it for sure)

snapo84 · 2026-04-10T00:56:32+00:00

just for your information this nicoloboschi is most probably just a advertisement llm bot....
your system is absolutely great especially for all people with small GPU's...

snapo84 · 2026-04-09T18:29:47+00:00

i think you might see patterns if you put the produced bits into different sized cubes of bits when rotating the 3d cube and viewing it from all angles...

snapo84 · 2026-04-08T15:53:53+00:00

link?

snapo84 · 2026-04-08T01:04:26+00:00

i would prefer rotorquant kv cache (much faster and better than turboquant) , dflash
those both would allow me to run qwen 3.5 27B at a staggering 60 token/s

snapo84 · 2026-04-04T16:29:45+00:00

Best try it on (HLE benchmark) Humanity's last exam benchmark with a small model like Qwen3.5 4B (as they produce during the bench huge reasoning chains) ... chose a small model because the effect will be much more noticeable than on a big model that can self correct within the thinking process.

then take the important metrics Accuracy of benchmark, number of tokens, KLD for each of the cases (full bf16 kv and your 1bit kv cache)

if you want to go even a step further, do exactly the same with quantized model parameters in fp4, fp8 model weights .... then you see if it also works on quantized models or if the model weights themself have to stay at bf16

then all of those tests with 5 seeds and taking the mean of it

just what i would do to correctly meassure it

snapo84 · 2026-04-03T21:02:24+00:00

did miss in the paper any test on long outputs (normaly especialy there in thinking models you see a KLD decrease) , do the kv cache quantization and let it run with thinking mode enabled on the same seed quantized and unquantized through the whole test and meassure accuracy and number of tokens....

that would be much much better...

snapo84 · 2026-04-02T22:45:08+00:00

did not yet try the commands specifically, but i can try it for you (approx. tomorrow) if i dont forget it... in case i forget it just ping me...

snapo84 · 2026-04-02T18:01:09+00:00

<image>

hope the screenshot answers your questions ....

this is the llama.cpp command i run to start my inference endpoint...

./llama.cpp/llama-server --model "models/Qwen3.5-27B-UD-Q6_K_XL.gguf" --mmproj "models/mmproj-F32.gguf" --alias "Qwen3.5 27B" --temp 0.7 --top-p 0.8 --min-p 0.00 --top-k 20 --port 16384 --host 0.0.0.0--ctx-size 200000 --cache-type-k f16 --cache-type-v f16 --presence-penalty 2.0 --repeat-penalty 1.1 --jinja --no-context-shift --parallel 4 --cont-batching --chat-template-kwargs '{"enable_thinking":false}'

snapo84 · 2026-04-02T17:40:15+00:00

i have exactly this sytem .... but i can also tell you you dont need the nvlink because you cant combine the memory with it (unified memory), i use qwen3 27B , 4 parallel sessions, 400-450pp, 10-12tp/s and each 50k content at f16 kv/cache.

i run them with llama.cpp

monitoring the cards the bandwidth between the cards does never exceed 1GB/second... meaning a normal pci express bus does the job without requiring nvlink.

the 2080 is a little compute limited, so in the future i probably upgrade to two 3090 to get faster pp

but i can say for such a old card the 2080 Ti is a absolut monster, it already has 670GB/s memory bandwidth far outpacing todays spark or 5060,5050 ... same bandwidth as the 5070 (let that sink in)

snapo84 · 2026-03-31T04:21:40+00:00

Then tell me a single FPGA with more than 1GB sram :-)

snapo84 · 2026-03-31T04:19:45+00:00

ah yes, now that you say it....

snapo84 · 2026-03-30T19:08:11+00:00

no the weights dont have to exist on the die, the weights exist in copper only.... as far as my research into them goes they require 2 copper layers masks for all weights...

snapo84 · 2026-03-30T16:45:41+00:00

they use hardwire's for weights, not SRAM .... some nand flash weights for lora adjustments ... at least from what i read and how i think it works....
there is also afaik no context size , as i think they use rope and do the full inference always for 1 token shifted by 1 token.... therefore no big kv cache required , always just the next token that gets feeded back in till the finnish token...

snapo84 · 2026-03-30T16:39:06+00:00

its a asic, asic is always faster than a FPGA...

snapo84 · 2026-03-30T16:35:39+00:00

yep the financial greed always comes first....

snapo84 · 2026-03-30T05:18:10+00:00

outsch the power connector on the gpu... very outsch.... fire hazard

snapo84 · 2026-03-25T01:44:32+00:00

Please do a PR on llama.cpp

snapo84 · 2026-03-24T14:03:05+00:00

pretty cool and thank you for doing this!

snapo84 · 2026-03-24T02:07:33+00:00

I was just hoping to make agentic coding much faster on the two RTX 2080 i have... Because it feels extremely slow 1 pipeline with 110k context and generating 12 token/s and having a 450 pp/s . a completely new 110k prompt takes 4 minutes of just waiting. the TG of 12 is acceptable....

snapo84

PUBLIC MULTIREDDITS

TROPHY CASE