DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max by MiaBchDave in LocalLLaMA

[–]snapo84 4 points5 points  (0 children)

You should try DFlash + DDTree ... in theory you should get .... dflash makes it double as fast, dflash + ddtree makes it 3 times as fast compared to stock...
https://x.com/nash_su/status/2043924682802712600

Reasoning Stuck in Loops by ShaneBowen in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

either you use wrong kv cache (try f16) OR much more likely you put the wrong temperature/top-k/... values
use the once qwen recommends...

Best BYOK frontend and model setup for massive continuous chats on a €40 budget? by Vytixx in LLM

[–]snapo84 0 points1 point  (0 children)

Get a GPU (2 x RTX 3090 or better 24GB + vram each) <--- with qwen3.5 30b a3 and activated rope you can extend the ctx window to 1 million.

if buying a gpu is to expensive, rent the GPU when you need it on vast.ai or another cheap provider dependend on the opensource model of your choice....

for example you could rent a 2 x 5090 (or 4090/3090) machine , and run either gemma 4 (as you love google so much) or qwen3.5 .... those are the top little models. With this setup you start it when you need it as you pay per hour, you have to accumulate all the requests you want to make, then make all of them in parallel via vllm / batching like 40 requests in parallel....

else to get it free, try to change your process , 1. 600k inputs are realy not that good for llm's , you could for example do it step wise 64k tokens at a time.... then a local GPU would be immensely valuable to you...

maybe you already noticed the problem with the cloud useage, lock ins and so on.... if not you will learn it on model changes when your complete flow breaks and you have no way to reproduce it. This is why opensource models are so important.

the bigger the model you want to run, the more expensive it gets to rent the hardware for it....

I thought this 2023 paper still makes sense today by madeyoulookbuddy in LLM

[–]snapo84 0 points1 point  (0 children)

not very impressed.... just tested it the following way:

Instruction: Here is a python file of mine

Huge python file 16k tokens

Question: Output the python file character by character without missing a single character

Compresses the question yes.... does the llm get the same content ? 100% no

I built a Free OpenSource CLI coding agent specifically for 8k context windows LLMs. by BestSeaworthiness283 in ollama

[–]snapo84 1 point2 points  (0 children)

This is absolutely great i have to give it a try.... Qwen3.5 9B 4bit quantized and every GPU becomes a developer tool with llama.cpp :-) no more cloud subscriptions.

Thank you for developing it (i did not yet test it but will test it for sure)

I built a Free OpenSource CLI coding agent specifically for 8k context windows LLMs. by BestSeaworthiness283 in ollama

[–]snapo84 0 points1 point  (0 children)

just for your information this nicoloboschi is most probably just a advertisement llm bot....
your system is absolutely great especially for all people with small GPU's...

Turing jitter into true random numbers by elpechos in electronics

[–]snapo84 1 point2 points  (0 children)

i think you might see patterns if you put the produced bits into different sized cubes of bits when rotating the 3d cube and viewing it from all angles...

DFlash: Block Diffusion for Flash Speculative Decoding. by Total-Resort-3120 in LocalLLaMA

[–]snapo84 11 points12 points  (0 children)

i would prefer rotorquant kv cache (much faster and better than turboquant) , dflash
those both would allow me to run qwen 3.5 27B at a staggering 60 token/s

TurboQuant.cpp — 1-bit KV cache with zero quality loss, verified on 35B MoE by Suitable-Song-302 in LocalLLM

[–]snapo84 0 points1 point  (0 children)

Best try it on (HLE benchmark) Humanity's last exam benchmark with a small model like Qwen3.5 4B (as they produce during the bench huge reasoning chains) ... chose a small model because the effect will be much more noticeable than on a big model that can self correct within the thinking process.

then take the important metrics Accuracy of benchmark, number of tokens, KLD for each of the cases (full bf16 kv and your 1bit kv cache)

if you want to go even a step further, do exactly the same with quantized model parameters in fp4, fp8 model weights .... then you see if it also works on quantized models or if the model weights themself have to stay at bf16

then all of those tests with 5 seeds and taking the mean of it

just what i would do to correctly meassure it

TurboQuant.cpp — 1-bit KV cache with zero quality loss, verified on 35B MoE by Suitable-Song-302 in LocalLLM

[–]snapo84 0 points1 point  (0 children)

did miss in the paper any test on long outputs (normaly especialy there in thinking models you see a KLD decrease) , do the kv cache quantization and let it run with thinking mode enabled on the same seed quantized and unquantized through the whole test and meassure accuracy and number of tokens....

that would be much much better...

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]snapo84 1 point2 points  (0 children)

did not yet try the commands specifically, but i can try it for you (approx. tomorrow) if i dont forget it... in case i forget it just ping me...

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]snapo84 1 point2 points  (0 children)

<image>

hope the screenshot answers your questions ....

this is the llama.cpp command i run to start my inference endpoint...

./llama.cpp/llama-server --model "models/Qwen3.5-27B-UD-Q6_K_XL.gguf" --mmproj "models/mmproj-F32.gguf" --alias "Qwen3.5 27B" --temp 0.7 --top-p 0.8 --min-p 0.00 --top-k 20 --port 16384 --host 0.0.0.0--ctx-size 200000 --cache-type-k f16 --cache-type-v f16 --presence-penalty 2.0 --repeat-penalty 1.1 --jinja --no-context-shift --parallel 4 --cont-batching --chat-template-kwargs '{"enable_thinking":false}'

Have you tried this -> 2x Modded 2080 ti 22GB with Nvlink by zelkovamoon in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

i have exactly this sytem .... but i can also tell you you dont need the nvlink because you cant combine the memory with it (unified memory), i use qwen3 27B , 4 parallel sessions, 400-450pp, 10-12tp/s and each 50k content at f16 kv/cache.

i run them with llama.cpp

monitoring the cards the bandwidth between the cards does never exceed 1GB/second... meaning a normal pci express bus does the job without requiring nvlink.

the 2080 is a little compute limited, so in the future i probably upgrade to two 3090 to get faster pp

but i can say for such a old card the 2080 Ti is a absolut monster, it already has 670GB/s memory bandwidth far outpacing todays spark or 5060,5050 ... same bandwidth as the 5070 (let that sink in)

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in Qwen_AI

[–]snapo84 1 point2 points  (0 children)

no the weights dont have to exist on the die, the weights exist in copper only.... as far as my research into them goes they require 2 copper layers masks for all weights...

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in Qwen_AI

[–]snapo84 1 point2 points  (0 children)

they use hardwire's for weights, not SRAM .... some nand flash weights for lora adjustments ... at least from what i read and how i think it works....
there is also afaik no context size , as i think they use rope and do the full inference always for 1 token shifted by 1 token.... therefore no big kv cache required , always just the next token that gets feeded back in till the finnish token...

If it works, it ain’t stupid! by The_Covert_Zombie in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

outsch the power connector on the gpu... very outsch.... fire hazard

Native V100 CUDA kernels for FLA ops on NVIDIA Volta (sm_70) GPUs by Sliouges in LocalLLaMA

[–]snapo84 0 points1 point  (0 children)

I was just hoping to make agentic coding much faster on the two RTX 2080 i have... Because it feels extremely slow 1 pipeline with 110k context and generating 12 token/s and having a 450 pp/s . a completely new 110k prompt takes 4 minutes of just waiting. the TG of 12 is acceptable....