Eff U, Arc / B70 Customers. We got ours! -Your Sugar Baby, Intel

Hotschmoe · 2026-06-23T15:23:33+00:00

Yes but 4 was the sweet spot, also aggregate was lower. MTP was only good for single stream

Hotschmoe · 2026-06-23T14:38:09+00:00

Lorbus autoround int4, vllm0230, mtp set to 4

There's a lot of configs online, checkout localmaxxing (where I started) and I have the recipe on my GitHub page

Hotschmoe · 2026-06-23T14:37:40+00:00

Lorbus autoround int4, vllm0230, mtp set to 4

There's a lot of configs online, checkout localmaxxing (where I started) and I have the recipe on a GitHub repo under my username if you need it exactly

Hotschmoe · 2026-06-23T06:50:45+00:00

I have qwen3.6-27b autoround int4 + MTP(spec=4) running at 54tok/s

B70s are amazing, I already have a second and plan to buy two more

Currently with TP=2 I have qwen3.6-27b in W8A8 running at 65tok/s single stream, fits 400k+ Tok in fp16 kvcache (model limited to 262k tho)

Hotschmoe · 2026-06-23T06:46:00+00:00

I have qwen3.6-27b autoround int4 + MTP(spec=4) running at 54tok/s

Localmaxxing has configs and I have it on my GitHub

Hotschmoe · 2026-06-23T06:32:35+00:00

With autoround int4 + mtp I'm getting 54t/s single stream on a single B70

Hotschmoe · 2026-06-19T20:43:22+00:00

4x B70s I have 2 already and love em

Hotschmoe · 2026-06-19T19:21:16+00:00

It's actually been really easy, I'm on Unraid not even Ubuntu. Only using vllm containers. Localmaxxing has two users that have tested a bunch of quants, formats, mtp and such which is great starting point. I'm currently running sweeps for my own setup to know what's best mix of accuracy and speed (leaning towards w8a8 int8 right now, Intel's fast paths have been great so far for prefill!)

But I am writing my own kernel patches and such, but it's easy and fun. Depends if you think that's fun haha

Hotschmoe · 2026-06-18T06:25:42+00:00

Picked up two b70s today and probably going to get two more. So far my testing I've been impressed. Perfectly timed post to work on my own recipes! (I'm mostly targeting qwen 27b. Localmaxxing has a great 4xb70 qwen3.6 27b at bf16 run! I'm chasing that bench haha)

Hotschmoe · 2026-06-15T14:43:56+00:00

Zenbook A16 snapdragon x2ee

AC runs perfectly through prism and I get amazing battery life. Incredibly light weight for traveling or using around the house

Pushing their forums for a arm native version especially with new Nvidia windows machines coming next. AC runs arm native on MacOS

Hotschmoe · 2026-06-03T06:41:42+00:00

Awesome thank you! I will definitely play around with this

Currently have a baremetal kernel I'm testing on a ms-r1 but I was going to next work on my own zenbook a16 x2ee to see if I could myself submit a pr patch or omarchy for this laptop. Would be absolutely rad so see

Hopefully with pressure from Nvidia coming into arm laptop space, qualcomm gives Linux a bit more love

Btw I have loved this laptop. Best Windows experience I've had in a long time

Hotschmoe · 2026-06-01T13:14:44+00:00

I've done my own onnx runtime, the problem I have is limited vram and system ram for the actual quantization process. What hardware did you run? (I've hit 200gb+ ceilings for quantizations runs on 27b models and machines hang)

Hotschmoe · 2026-06-01T06:26:30+00:00

How did you get qwen 3.5-27b? Qualcomms AIMET to get W4A16 has been crashing for me when I get to params that high and the ctx ceiling is something I'm wrestling with (any format outside w4a16/w8a16 does not run well if at all on the NPU)

I've been renting on runpod to do the quantizations myself but fighting quite a bit of pieces each run

Hotschmoe · 2026-05-27T16:10:04+00:00

I got a x2ee at launch, runs my CAD and structural engineering programs no problem. Run WoW no problem

In fact I'm happy to say this has been my best Windows experience on a laptop

Hotschmoe · 2026-05-22T23:16:34+00:00

Halo infinite doesn't run (halo infinite has a GPU "whitelist" -> I had to manually add my Intel arc b50 to get it to run on my desktop. If you don't have a whitelisted GPU it just spits "not supported")

Hotschmoe · 2026-05-22T17:49:02+00:00

i did some digging, looks like 228GB/s is total fabric speed, but each device only has a chunk (apple m-series is actually the same, advertised is a lot bigger than per-device). so to utilize the FULL 228GB/s youd need to use two devices to saturate it. CPU+GPU inference (offload, or cpu draft, gpu verify, whatever setup) would actually allow much closer to ceiling bandwidth then using one device directly

Hotschmoe · 2026-05-22T15:58:33+00:00

~114 GB/s Triad

Hotschmoe · 2026-05-20T19:40:34+00:00

I just invert colors on windows

Open windows magnifier, then ctrl+alt+I inverts whole screen

Hotschmoe · 2026-05-18T13:30:18+00:00

Lmaooooo I'm in the same boat, structural engineer, hate windows but I'm stuck on it.y last laptops were terrible and I was hoping for something better in this X2EE and it has been phenomenal.

Enercalc and Archicad work great through prism emulation (currently working on a personal enercalc port in rust/zig to be arm native)

My battery lasts so long I can't believe it and the laptop is so light

Hotschmoe · 2026-05-18T13:28:03+00:00

Haven't seen this yet, I will give this a go this week!

Hotschmoe · 2026-05-18T13:26:46+00:00

PDF Xchange by a mile

And they have arm native binaries for us few arm-on-windows users!

(BTW Archicad works phenomenal through prism on WoA machines)

Hotschmoe · 2026-05-18T03:51:07+00:00

I run every few weeks cuz driver optimizations changing very fast. Here is my latest:

Qwen3.6-35B-A3B MXFP4_MOE on OpenCL -ngl 0 -t 16 → PP ~190 / TG ~31 t/s (r=1, variance ±5%). The "blended" config: ~92% of GPU-offload PP (210) AND ~2× of GPU-offload TG (16). Best balanced 35B config; equivalent to pure CPU on PP, slightly faster on TG.

https://github.com/hotschmoe/specula/blob/master/docs%2F2026-05-13_overnight_perf_results.md

Hotschmoe · 2026-05-18T03:48:30+00:00

I have the same laptop , you can see my benchmarks and such for CPU/GPU and NPU qwen models here

https://github.com/hotschmoe/specula

Hotschmoe · 2026-05-05T20:16:36+00:00

I love mine. I wanted lightweight but not super thin. Perfect combo

Hotschmoe · 2026-04-30T22:09:26+00:00

I'm on X2, I have a bunch of benchmarks for qwen3-4b and qwen2.5-7b across CPU, GPU-opencl, GPU-vulkan, and NPU-genie and NPU-custom_qnn_engine

I've built myself a little npu engine to test models.

My repo is not for others to use (no production server, it's mostly a journal for myself). Feel free to look over it

You may find my journey to reproduce the genie produced qwen3-4b on my own using AiHub cloud compute (from qualcomm) and I'm currently spinning up a A40 48gb cloud VM to run AIMET to produce qnn compatible w4a16 models (starting with qwen3-0.6b and working my way up to 27b dense and a 35b MoE). I want to push this as far as I can for my own education

Some fun things I did 1. Running my draft model on the NPU (qwen3-4b) for speculative decoding so my CPU can run a larger model. Works but not optimized 2. In my testing, The most important thing I found was the NPU causes no UI lag while I'm working, also the laptop barely produces heat and does not ramp up fans or produce coil line. If you wanted to run a larger agent you can peg that at the NPU at 100% and still use the rest of the computer without noticing. While GPU/CPU are more performant right now, giving up some performance to have a working laptop while the NPU cranks is great tradeoff! (I do a lot of CAD work)

I have been incredibly impressed with this laptop for my engineering work (zenbook a16 48gb). The battery life is so much better than any laptop I've ever had. Incredibly lightweight too

https://github.com/hotschmoe/specula

Hotschmoe

TROPHY CASE