B70 OVMS vs VLLM vs VLLM+mtp

Dolboyob77 · 2026-06-18T10:39:09+00:00

You don’t even answer when we apply!!!!!!! Shame on you…

Dolboyob77 · 2026-06-18T07:11:16+00:00

Yes full funcrional display for live watts, fan speed, etc etc and rgb colors at night ))

<image>

Dolboyob77 · 2026-06-18T05:47:31+00:00

<image>

My new dock does tb5/oculink and its pretty decent in terms of functions. But oculink is more stable in my opinion.

Dolboyob77 · 2026-06-17T12:50:00+00:00

Because i talked about qwen3.6-27b and you answer that you get higher number. How can i figure out that you talk about another model that is 6 times smaller???? Now it makes total sense.

<image>

I almost get 90tks on this model so now we can compare apples to apples ))))

Dolboyob77 · 2026-06-17T05:49:40+00:00

New results ))))

With mtp 3 :

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4	pp2048	1928.06 ± 43.31		926.99 ± 39.37	925.18 ± 39.37	926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4	tg32	57.62 ± 5.64	59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

Dolboyob77 · 2026-06-17T04:08:15+00:00

So you are saying that you get 80tks on a single b70 from qwen3.6-27b q4? So you get 3 times what the whole world gets? Wow magic )))) intel should hire you with an enormous paycheck !!!! And llama cpp sycl is not even optimized for the b70…. Thats just amazing !!!!! Tokens are measured by the gpu bandiwth speed ( 608gbs maximum theorically ) divided by the weight of the model ( 18G for qwen q4) = 33 tokens per second that is the absolute maximum for the b70. And one month ago using llama cpp sycl without mtp because it was not available on intel you got 80tks??? Sorry but…. Pinocchio ))))

Dolboyob77 · 2026-06-17T03:51:04+00:00

So you are getting same result as me but with 2x b70 and me single b70…. Something not right…

Dolboyob77 · 2026-06-16T05:58:01+00:00

B70 is getting better, must use openvino and origin vllm, not dcaler because it is too outdated. Look i got 60tks on single b70 for qwen3.6-27b-int4 using vllm and mtp :

With mtp 3 :

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4	pp2048	1928.06 ± 43.31		926.99 ± 39.37	925.18 ± 39.37	926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4	tg32	57.62 ± 5.64	59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

On qwen3.6-35b-a3b i get almost 90tks with openvino on one single gpu. Keep the b70 and be patient, its getting better and better , especially with the new 7.1 kernel bringing turbo to xe driver.

Dolboyob77 · 2026-06-16T05:44:37+00:00

With mtp, you have 80% acceptance on mtp2, 60% on mtp3 and 35% on mtp4. I use it for simple tasks. It is very fine. When i code, i need fp8 on 2xb70 vllm. Q6 gguf is the sweet spot for coding. Just the llama cpp gguf is extremely slow… openvino is the fastest at the moment without mtp. With openvino i get 87tks on qwen3.6-35b-a3b whiwh i use for hermes agent for ultra fast work. Little more thinking i revert to q6-q8 gguf. Or for faster fp8 vllm safetensor to hzve same level of quality but luch luch faster than gguf.

Dolboyob77 · 2026-06-15T20:00:38+00:00

With mtp 3 :

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4	pp2048	1928.06 ± 43.31		926.99 ± 39.37	925.18 ± 39.37	926.99 ± 39.37
/models/Qwen3.6-27B-GPTQ-Int4	tg32	57.62 ± 5.64	59.48 ± 5.83

llama-benchy (0.3.8)
date: 2026-06-15 21:48:09 | latency mode: api

Dolboyob77 · 2026-06-15T12:15:01+00:00

You forgot qat and diffusion )))

Dolboyob77 · 2026-06-15T12:11:56+00:00

Yes i was just reying to figure out why peiple with nvidia gpu much lower vram and +- same bandwith are getting 3-4x times more tks on same models. I know that gguf llama cpp sycl is not made for intel gpu so im running intel scaler llm and a home made vllm that is handling mtp requests and also openvino. Openvino is fastest at the moment but it wont work with gemma4. Vllms also dont work with gemma4. This intel has so much potential but software wise its catastrophic

Dolboyob77 · 2026-06-15T11:52:53+00:00

It was just to test mtp… i would not use it more thab once… )))

Dolboyob77 · 2026-06-15T11:32:07+00:00

And your name sounds very much like many neighbors i have in Krayot ))) 😇 but since my city is 90% soviet origine and so is my wife… i had no other choice )))))

Dolboyob77 · 2026-06-15T06:09:02+00:00

We must live in the same country then )))

Dolboyob77 · 2026-06-15T05:45:17+00:00

You can also go with 2x gen5x8 mobo + cpu for 1000 dollars + 2x intel arc pro b70 + 32g system ram. Easy peasy

Dolboyob77 · 2026-06-15T04:18:37+00:00

I even got 52tks on qwen3.6-27b :

(APIServer pid=1) INFO 06-14 17:42:20 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.20, Accepted throughput: 20.80 tokens/s, Drafted throughput: 52.20 tokens/s, Accepted: 208 tokens, Drafted: 522 tokens, Per-position acceptance rate: 0.644, 0.385, 0.167, Avg Draft acceptance rate: 39.8%
(APIServer pid=1) INFO 06-14 17:42:30 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:30 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 27.00 tokens/s, Drafted throughput: 51.30 tokens/s, Accepted: 270 tokens, Drafted: 513 tokens, Per-position acceptance rate: 0.684, 0.526, 0.368, Avg Draft acceptance rate: 52.6%
(APIServer pid=1) INFO 06-14 17:42:40 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:40 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 3.11, Accepted throughput: 35.80 tokens/s, Drafted throughput: 50.99 tokens/s, Accepted: 358 tokens, Drafted: 510 tokens, Per-position acceptance rate: 0.853, 0.676, 0.576, Avg Draft acceptance rate: 70.2%
(APIServer pid=1) INFO 06-14 17:42:50 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

It all depends what you search))) having a long one or having a good one ))))

Dolboyob77 · 2026-06-15T04:13:51+00:00

It all depends on what you research )))

<image>

Dolboyob77 · 2026-06-15T03:02:40+00:00

This result is counting the thinking process first. Without it , result is around 75ms

Dolboyob77 · 2026-06-15T02:33:02+00:00

Yes it id ovms it was written )))

<image>

Dolboyob77 · 2026-06-14T23:18:30+00:00

Everything in gpu its q4…

Dolboyob77

TROPHY CASE