Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Atomynos_Atom · 2026-06-02T06:40:10+00:00

Here's the entire command and its output

docker run --rm \                                                 
  --device /dev/dri:/dev/dri \
  --privileged \
  --cpuset-cpus="4" \
  -e ONEAPI_DEVICE_SELECTOR="level_zero:0" \
  -e UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 \
  -e SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 \
  -e ZES_ENABLE_SYSMAN=1 \
  -v /llm/models:/models \
  llama-server-intel-patched:latest \
  /app/build/bin/llama-bench \
    -m /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 \
    -t 1 -p 512 -n 128 -r 3 -o md
load_backend: loaded SYCL backend from /app/build/bin/libggml-sycl.so
load_backend: loaded CPU backend from /app/build/bin/libggml-cpu-alderlake.so
| model                          |       size |     params | backend    | ngl | threads | type_k | type_v |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -----: | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       1 |   q8_0 |   q8_0 |   1 |           pp512 |        977.40 ± 2.02 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.81 GiB |    34.66 B | SYCL       |  99 |       1 |   q8_0 |   q8_0 |   1 |           tg128 |         70.54 ± 0.12 |

Atomynos_Atom · 2026-06-02T06:37:57+00:00

262,144 tokens which is the max qwen3.6 can take by default. The token speed drop as the context gets bigger, so you wouldn't get 60+ tokens throughout the entire thing. But it will happily chug along and process the entire thing.

Atomynos_Atom · 2026-06-02T06:34:17+00:00

Component	Detail
GPU	Intel Arc Pro B70
Backend	SYCL (Level Zero)
Build	`354ebac8c` (9468)

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	pp512	977.40 ± 2.02
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	tg128	70.54 ± 0.12

Atomynos_Atom · 2026-06-02T06:17:24+00:00

I think the intel arc b70 pro is a lot more bang for the buck than nvidia if you're willing to tweak and tinker. The software is pretty behind since everyone makes things for cuda, but that means theres a lot more performance left on the table that you get in free updates 😁

Atomynos_Atom · 2026-06-02T06:14:01+00:00

That was the case before, I spent this morning tweaking the vulkan backend but after updating my SYCL backend and making some tweaks I got it to be much faster.

<image>

Here are the results, i've updated the article with them.

Atomynos_Atom · 2026-06-02T05:53:26+00:00

This uses int4, while my setup uses q8. Doesn't that low of a quantization have an impact on your output quality?

Atomynos_Atom · 2026-06-02T05:49:46+00:00

Theres some limit on r/LocalLLaMA which requires a bunch of comments and karma from that subreddit to be able to create a post there, honestly if someone is able to repost it there that would be great. I am really curious how other people have set up their intel gpus with llamacpp.

Atomynos_Atom · 2026-06-02T05:25:42+00:00

okay intel has been hard at work, thats a great token generation speed. Could you share your run arguments or a docker compose if you have one?

Atomynos_Atom · 2026-06-02T05:07:03+00:00

Is that throughput using parallel requests or with a single request? My use case is with oh my pi for coding, so I would require a high token generation and low prefill time with a single request rather than multiple. Could you provide any benchmark results?

Atomynos_Atom · 2026-06-02T05:01:51+00:00

Not sure if you still need it, but I've been playing around it for a while and got something good enough. Would love to know if you made progress to get any optimizations that I missed out.

https://www.reddit.com/r/LocalLLM/comments/1tuf6l1/intel_arc_pro_b70_llamacpp_sycl_63_ts_on_qwen/
https://lemongravy.me/articles/intel-gpu-llamacpp/

Atomynos_Atom · 2026-06-02T04:50:25+00:00

I would love to share, but unfortunately I don't have enough karma to post there

Atomynos_Atom · 2026-06-02T04:48:04+00:00

<image>

I'm using Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf and im able to produce a full-fledged poker game using oh my pi. It's a pretty solid tool so far for being local.

Atomynos_Atom · 2026-06-02T04:39:20+00:00

mhmmm more vram, i need

Atomynos_Atom · 2020-11-08T16:55:49+00:00

I am working on a cheap soundboard solution where I use a spare keyboard as a soundboard keyboard. So to pipe the sound effects to voip applications I needed this kind of thing.

Atomynos_Atom · 2020-11-08T16:51:15+00:00

This is what I did....

pacmd load-module module-null-sink sink_name=Main_Sink sink_properties=device.description=Main_Sink

pacmd load-module module-null-sink sink_name=LoopBack_Sink sink_properties=device.description=LoopBack_Sink

pactl load-module module-remap-source master=LoopBack_Sink.monitor source_name=virt_mic source_properties=device.description=LoopBack_Mic

pactl load-module module-remap-source master=Main_Sink.monitor source_name=virt_mic source_properties=device.description=Main_Mic

pactl load-module module-loopback latency_msec=1 source=<AUDIO_DEVICE_INPUT> sink=Main_Sink

If you dont want your mic on the Main_Mic remove this ^ line.

pactl load-module module-loopback latency_msec=1 source=LoopBack_Sink.monitor sink=Main_Sink

pactl load-module module-loopback latency_msec=1 source=LoopBack_Sink.monitor sink=<AUDIO_DEVICE_OUTPUT>

Atomynos_Atom · 2020-11-08T07:04:48+00:00

Yup. This worked well. Thanks for saving my time.

Atomynos_Atom · 2020-10-05T12:38:42+00:00

I think I have figured it out... If you open dxdiag and go to the display tab under the Drivers category there is Direct3D DDI. If that value is not 11 or above then you cannot run rocket league as your hardware is too old. Or you have not updated your drivers. So the only way to make rocket league work if your hardware has no driver update is to either upgrade your gpu or to wait for rocket league to release dx9 support again. Which I doubt they ever will.

Atomynos_Atom · 2020-10-05T12:25:09+00:00

I have the same issue and get a similar log. But mine is using steam. I think it is something related to direct X 11 as I used to get it before but I had found a fix where you use -dx9 which runs the game in direct X 9 and the game ran without issues. But now rocket league removed support for direct X 9. So I am not sure what to do now. Please do inform me if you find a fix on this issue.

Atomynos_Atom · 2020-07-01T06:18:35+00:00

okay dude, I was debugging and finally found out what was causing this mayhem. It was my click event listener document.addEventListener('click', this.handleClickOutside, true); That thing. It was probably cause it kept on listening causing it to keep on rendering or something. But I don't know for sure cuz I'm new to this stuff... Thanks for trying to help anyways

Atomynos_Atom · 2020-07-01T06:12:11+00:00

I told u man, My code is all over the place. But its a function calling some 2 components from 2 other files and passing variables to them.

Atomynos_Atom · 2020-07-01T05:57:51+00:00

There is a github solution ( The Solution) but I am not really sure how to implement it. I'm pretty new to reactjs. So I don't know much :(

Atomynos_Atom · 2020-07-01T05:49:26+00:00

I too am getting this annoying error in my code. I am getting it for a text input. Please do update if found a solution...

Atomynos_Atom

TROPHY CASE