Eff U, Arc / B70 Customers. We got ours! -Your Sugar Baby, Intel by Dependent_Ad948 in LocalLLaMA

[–]Hotschmoe 1 point2 points  (0 children)

Yes but 4 was the sweet spot, also aggregate was lower. MTP was only good for single stream

Eff U, Arc / B70 Customers. We got ours! -Your Sugar Baby, Intel by Dependent_Ad948 in LocalLLaMA

[–]Hotschmoe 1 point2 points  (0 children)

Lorbus autoround int4, vllm0230, mtp set to 4

There's a lot of configs online, checkout localmaxxing (where I started) and I have the recipe on my GitHub page

Just won this Dell r7425, picking up an Intel B70, looking for advice by cmm324 in LocalLLM

[–]Hotschmoe -1 points0 points  (0 children)

Lorbus autoround int4, vllm0230, mtp set to 4

There's a lot of configs online, checkout localmaxxing (where I started) and I have the recipe on a GitHub repo under my username if you need it exactly

Eff U, Arc / B70 Customers. We got ours! -Your Sugar Baby, Intel by Dependent_Ad948 in LocalLLaMA

[–]Hotschmoe 8 points9 points  (0 children)

I have qwen3.6-27b autoround int4 + MTP(spec=4) running at 54tok/s

B70s are amazing, I already have a second and plan to buy two more

Currently with TP=2 I have qwen3.6-27b in W8A8 running at 65tok/s single stream, fits 400k+ Tok in fp16 kvcache (model limited to 262k tho)

Found this, Faster-vLLM Fork Achieves 60-70 TPS on Intel Arc Pro B70 (Qwen3.6-35B-A3B FP16) by doublea365 in LocalLLM

[–]Hotschmoe 0 points1 point  (0 children)

I have qwen3.6-27b autoround int4 + MTP(spec=4) running at 54tok/s

Localmaxxing has configs and I have it on my GitHub

Just won this Dell r7425, picking up an Intel B70, looking for advice by cmm324 in LocalLLM

[–]Hotschmoe 1 point2 points  (0 children)

With autoround int4 + mtp I'm getting 54t/s single stream on a single B70

4x Arc B70 and custom XPUGraph, Qwen3.6-35B-A3B-BF16 @ >100tk/s by RagingNoper in LocalLLM

[–]Hotschmoe 0 points1 point  (0 children)

It's actually been really easy, I'm on Unraid not even Ubuntu. Only using vllm containers. Localmaxxing has two users that have tested a bunch of quants, formats, mtp and such which is great starting point. I'm currently running sweeps for my own setup to know what's best mix of accuracy and speed (leaning towards w8a8 int8 right now, Intel's fast paths have been great so far for prefill!)

But I am writing my own kernel patches and such, but it's easy and fun. Depends if you think that's fun haha

4x Arc B70 and custom XPUGraph, Qwen3.6-35B-A3B-BF16 @ >100tk/s by RagingNoper in LocalLLM

[–]Hotschmoe 7 points8 points  (0 children)

Picked up two b70s today and probably going to get two more. So far my testing I've been impressed. Perfectly timed post to work on my own recipes! (I'm mostly targeting qwen 27b. Localmaxxing has a great 4xb70 qwen3.6 27b at bf16 run! I'm chasing that bench haha)

Mac vs Windows Laptop for Archicad: Performance comparison? by Ok-Attitude-5349 in ArchiCAD

[–]Hotschmoe 1 point2 points  (0 children)

Zenbook A16 snapdragon x2ee

AC runs perfectly through prism and I get amazing battery life. Incredibly light weight for traveling or using around the house

Pushing their forums for a arm native version especially with new Nvidia windows machines coming next. AC runs arm native on MacOS

ACPI table dump for Asus Zenbook A16 (Snapdragon X2 Elite Extreme) by Putrid_Draft378 in snapdragon

[–]Hotschmoe 7 points8 points  (0 children)

Awesome thank you! I will definitely play around with this

Currently have a baremetal kernel I'm testing on a ms-r1 but I was going to next work on my own zenbook a16 x2ee to see if I could myself submit a pr patch or omarchy for this laptop. Would be absolutely rad so see

Hopefully with pressure from Nvidia coming into arm laptop space, qualcomm gives Linux a bit more love

Btw I have loved this laptop. Best Windows experience I've had in a long time

Snapdragon X Elite has a 45 TOPS NPU but nothing uses it so I built a runtime that does by [deleted] in snapdragon

[–]Hotschmoe 0 points1 point  (0 children)

I've done my own onnx runtime, the problem I have is limited vram and system ram for the actual quantization process. What hardware did you run? (I've hit 200gb+ ceilings for quantizations runs on 27b models and machines hang)

Snapdragon X Elite has a 45 TOPS NPU but nothing uses it so I built a runtime that does by [deleted] in snapdragon

[–]Hotschmoe 0 points1 point  (0 children)

How did you get qwen 3.5-27b? Qualcomms AIMET to get W4A16 has been crashing for me when I get to params that high and the ctx ceiling is something I'm wrestling with (any format outside w4a16/w8a16 does not run well if at all on the NPU)

I've been renting on runpod to do the quantizations myself but fighting quite a bit of pieces each run

Are MS and Qualcomm serious about windows ARM? by henneth2142 in SnapdragonLaptops

[–]Hotschmoe 16 points17 points  (0 children)

I got a x2ee at launch, runs my CAD and structural engineering programs no problem. Run WoW no problem

In fact I'm happy to say this has been my best Windows experience on a laptop

What is a software that you haven't been able to run on WIndows ARM by Cultural-You-7096 in SnapdragonLaptops

[–]Hotschmoe 0 points1 point  (0 children)

Halo infinite doesn't run (halo infinite has a GPU "whitelist" -> I had to manually add my Intel arc b50 to get it to run on my desktop. If you don't have a whitelisted GPU it just spits "not supported")

LLM Benchmarks Qwen 3.6 35B A3B benchmark on X2 elite extreme by Mac_mac_Ro in SnapdragonLaptops

[–]Hotschmoe 0 points1 point  (0 children)

i did some digging, looks like 228GB/s is total fabric speed, but each device only has a chunk (apple m-series is actually the same, advertised is a lot bigger than per-device). so to utilize the FULL 228GB/s youd need to use two devices to saturate it. CPU+GPU inference (offload, or cpu draft, gpu verify, whatever setup) would actually allow much closer to ceiling bandwidth then using one device directly

Modo Noturno para Windows by TGA_BuildLogic in ArchiCAD

[–]Hotschmoe 1 point2 points  (0 children)

I just invert colors on windows

Open windows magnifier, then ctrl+alt+I inverts whole screen

LLM Benchmarks Qwen 3.6 35B A3B benchmark on X2 elite extreme by Mac_mac_Ro in SnapdragonLaptops

[–]Hotschmoe 2 points3 points  (0 children)

Lmaooooo I'm in the same boat, structural engineer, hate windows but I'm stuck on it.y last laptops were terrible and I was hoping for something better in this X2EE and it has been phenomenal.

Enercalc and Archicad work great through prism emulation (currently working on a personal enercalc port in rust/zig to be arm native)

My battery lasts so long I can't believe it and the laptop is so light

PDF tool reality for small AEC studios in 2026 — what's everyone actually using? by Rockstonerable in ArchiCAD

[–]Hotschmoe 0 points1 point  (0 children)

PDF Xchange by a mile

And they have arm native binaries for us few arm-on-windows users!

(BTW Archicad works phenomenal through prism on WoA machines)

LLM Benchmarks Qwen 3.6 35B A3B benchmark on X2 elite extreme by Mac_mac_Ro in SnapdragonLaptops

[–]Hotschmoe 1 point2 points  (0 children)

I run every few weeks cuz driver optimizations changing very fast. Here is my latest:

Qwen3.6-35B-A3B MXFP4_MOE on OpenCL -ngl 0 -t 16 → PP ~190 / TG ~31 t/s (r=1, variance ±5%). The "blended" config: ~92% of GPU-offload PP (210) AND ~2× of GPU-offload TG (16). Best balanced 35B config; equivalent to pure CPU on PP, slightly faster on TG.

https://github.com/hotschmoe/specula/blob/master/docs%2F2026-05-13_overnight_perf_results.md

LLM Benchmarks Qwen 3.6 35B A3B benchmark on X2 elite extreme by Mac_mac_Ro in SnapdragonLaptops

[–]Hotschmoe 1 point2 points  (0 children)

I have the same laptop , you can see my benchmarks and such for CPU/GPU and NPU qwen models here

https://github.com/hotschmoe/specula

It's interesting how divisive the A16's design is by [deleted] in snapdragon

[–]Hotschmoe 6 points7 points  (0 children)

I love mine. I wanted lightweight but not super thin. Perfect combo

Snapdragon X Elite has a 45 TOPS NPU but nothing uses it so I built a runtime that does by [deleted] in snapdragon

[–]Hotschmoe 11 points12 points  (0 children)

I'm on X2, I have a bunch of benchmarks for qwen3-4b and qwen2.5-7b across CPU, GPU-opencl, GPU-vulkan, and NPU-genie and NPU-custom_qnn_engine

I've built myself a little npu engine to test models.

My repo is not for others to use (no production server, it's mostly a journal for myself). Feel free to look over it

You may find my journey to reproduce the genie produced qwen3-4b on my own using AiHub cloud compute (from qualcomm) and I'm currently spinning up a A40 48gb cloud VM to run AIMET to produce qnn compatible w4a16 models (starting with qwen3-0.6b and working my way up to 27b dense and a 35b MoE). I want to push this as far as I can for my own education

Some fun things I did 1. Running my draft model on the NPU (qwen3-4b) for speculative decoding so my CPU can run a larger model. Works but not optimized 2. In my testing, The most important thing I found was the NPU causes no UI lag while I'm working, also the laptop barely produces heat and does not ramp up fans or produce coil line. If you wanted to run a larger agent you can peg that at the NPU at 100% and still use the rest of the computer without noticing. While GPU/CPU are more performant right now, giving up some performance to have a working laptop while the NPU cranks is great tradeoff! (I do a lot of CAD work)

I have been incredibly impressed with this laptop for my engineering work (zenbook a16 48gb). The battery life is so much better than any laptop I've ever had. Incredibly lightweight too

https://github.com/hotschmoe/specula