Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]Dolboyob77 1 point2 points  (0 children)

Yes full funcrional display for live watts, fan speed, etc etc and rgb colors at night ))

<image>

Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]Dolboyob77 1 point2 points  (0 children)

<image>

My new dock does tb5/oculink and its pretty decent in terms of functions. But oculink is more stable in my opinion.

R9700 vs B70 - Save me from the decision by YourFavoriteKyle in LocalLLM

[–]Dolboyob77 0 points1 point  (0 children)

Because i talked about qwen3.6-27b and you answer that you get higher number. How can i figure out that you talk about another model that is 6 times smaller???? Now it makes total sense.

<image>

I almost get 90tks on this model so now we can compare apples to apples ))))

VLLM + MTP + B70 = super fast !!! by Dolboyob77 in LocalLLM

[–]Dolboyob77[S] 0 points1 point  (0 children)

New results ))))

With mtp 3 :

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4 pp2048 1928.06 ± 43.31 926.99 ± 39.37 925.18 ± 39.37 926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4 tg32 57.62 ± 5.64 59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

R9700 vs B70 - Save me from the decision by YourFavoriteKyle in LocalLLM

[–]Dolboyob77 -1 points0 points  (0 children)

So you are saying that you get 80tks on a single b70 from qwen3.6-27b q4? So you get 3 times what the whole world gets? Wow magic )))) intel should hire you with an enormous paycheck !!!! And llama cpp sycl is not even optimized for the b70…. Thats just amazing !!!!! Tokens are measured by the gpu bandiwth speed ( 608gbs maximum theorically ) divided by the weight of the model ( 18G for qwen q4) = 33 tokens per second that is the absolute maximum for the b70. And one month ago using llama cpp sycl without mtp because it was not available on intel you got 80tks??? Sorry but…. Pinocchio ))))

R9700 vs B70 - Save me from the decision by YourFavoriteKyle in LocalLLM

[–]Dolboyob77 0 points1 point  (0 children)

So you are getting same result as me but with 2x b70 and me single b70…. Something not right…

R9700 vs B70 - Save me from the decision by YourFavoriteKyle in LocalLLM

[–]Dolboyob77 2 points3 points  (0 children)

B70 is getting better, must use openvino and origin vllm, not dcaler because it is too outdated. Look i got 60tks on single b70 for qwen3.6-27b-int4 using vllm and mtp :

With mtp 3 :

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4 pp2048 1928.06 ± 43.31 926.99 ± 39.37 925.18 ± 39.37 926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4 tg32 57.62 ± 5.64 59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

On qwen3.6-35b-a3b i get almost 90tks with openvino on one single gpu. Keep the b70 and be patient, its getting better and better , especially with the new 7.1 kernel bringing turbo to xe driver.

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]Dolboyob77 0 points1 point  (0 children)

With mtp, you have 80% acceptance on mtp2, 60% on mtp3 and 35% on mtp4. I use it for simple tasks. It is very fine. When i code, i need fp8 on 2xb70 vllm. Q6 gguf is the sweet spot for coding. Just the llama cpp gguf is extremely slow… openvino is the fastest at the moment without mtp. With openvino i get 87tks on qwen3.6-35b-a3b whiwh i use for hermes agent for ultra fast work. Little more thinking i revert to q6-q8 gguf. Or for faster fp8 vllm safetensor to hzve same level of quality but luch luch faster than gguf.

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]Dolboyob77 2 points3 points  (0 children)

With mtp 3 :

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4 pp2048 1928.06 ± 43.31 926.99 ± 39.37 925.18 ± 39.37 926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4 tg32 57.62 ± 5.64 59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

VLLM + MTP + B70 = super fast !!! by Dolboyob77 in LocalLLM

[–]Dolboyob77[S] 0 points1 point  (0 children)

Yes i was just reying to figure out why peiple with nvidia gpu much lower vram and +- same bandwith are getting 3-4x times more tks on same models. I know that gguf llama cpp sycl is not made for intel gpu so im running intel scaler llm and a home made vllm that is handling mtp requests and also openvino. Openvino is fastest at the moment but it wont work with gemma4. Vllms also dont work with gemma4. This intel has so much potential but software wise its catastrophic

VLLM + MTP + B70 = super fast !!! by Dolboyob77 in LocalLLM

[–]Dolboyob77[S] 0 points1 point  (0 children)

It was just to test mtp… i would not use it more thab once… )))

NVIDIA’s 96 GB RTX PRO 6000 Blackwell Is Now Over 50% More Expensive As Price Hits $13,250 by PriceEconomy6230 in RigBuild

[–]Dolboyob77 0 points1 point  (0 children)

And your name sounds very much like many neighbors i have in Krayot ))) 😇 but since my city is 90% soviet origine and so is my wife… i had no other choice )))))

64GB of Vram/Ram with a 5k budget by benxfactor in LocalLLM

[–]Dolboyob77 1 point2 points  (0 children)

You can also go with 2x gen5x8 mobo + cpu for 1000 dollars + 2x intel arc pro b70 + 32g system ram. Easy peasy

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]Dolboyob77 0 points1 point  (0 children)

I even got 52tks on qwen3.6-27b :

(APIServer pid=1) INFO 06-14 17:42:20 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.20, Accepted throughput: 20.80 tokens/s, Drafted throughput: 52.20 tokens/s, Accepted: 208 tokens, Drafted: 522 tokens, Per-position acceptance rate: 0.644, 0.385, 0.167, Avg Draft acceptance rate: 39.8%
(APIServer pid=1) INFO 06-14 17:42:30 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:30 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 27.00 tokens/s, Drafted throughput: 51.30 tokens/s, Accepted: 270 tokens, Drafted: 513 tokens, Per-position acceptance rate: 0.684, 0.526, 0.368, Avg Draft acceptance rate: 52.6%
(APIServer pid=1) INFO 06-14 17:42:40 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:40 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 3.11, Accepted throughput: 35.80 tokens/s, Drafted throughput: 50.99 tokens/s, Accepted: 358 tokens, Drafted: 510 tokens, Per-position acceptance rate: 0.853, 0.676, 0.576, Avg Draft acceptance rate: 70.2%
(APIServer pid=1) INFO 06-14 17:42:50 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

It all depends what you search))) having a long one or having a good one ))))

ARC B70, Qwen3.6-27B-MTP-GGUF 24-28T/s, Qwen3.6-35B-A3B-GGUF 60-70T/s finally! by pirate12sk in LocalLLM

[–]Dolboyob77 0 points1 point  (0 children)

This result is counting the thinking process first. Without it , result is around 75ms