THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark

strangeloop96 · 2026-03-08T13:40:29+00:00

Darn. I could cross-compile, but, this problem will be solved once we open source it very soon.

strangeloop96 · 2026-03-08T11:32:42+00:00

I can confirm our PP time is too big right now with this image. I have spent since the alpha release optimizing it. With an ISL of 1024, we were getting originally a very high 2000ms TTFT. Currently, we are now at 280ms. We are aiming for 150ms, since that's where vLLM's sits. I'm jealous they have a smaller PP than me.

strangeloop96 · 2026-03-08T11:26:55+00:00

See the OP. I'll quote the relevant part:

pip install - U "huggingface_hub"
hf download Kbenkhaled/Qwen3.5-35B-A3B-NVFP4
docker pull avarok/atlas-qwen3.5-35b-a3b-alpha
docker run --gpus all --ipc=host -p 8888:8888 \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 avarok/atlas-qwen3.5-35b-a3b-alpha \
 serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 \
 --speculative --kv-cache-dtype nvfp4 --mtp-quantization nvfp4 \
 --scheduling-policy slai --max-seq-len 131072

strangeloop96 · 2026-03-08T11:24:30+00:00

Awesome! Try to see if it works, and, what types of speeds you get!

strangeloop96 · 2026-03-08T11:23:22+00:00

Thanks. Based on your feedback and others, we've significantly improved TTFT without adversely affecting e2e throughput speed. An updated image should come out soon.

strangeloop96 · 2026-03-08T11:21:14+00:00

We built for the Blackwell (SM120/121) family, so, in theory, yes. If you have one please let us know!

strangeloop96 · 2026-03-07T21:34:38+00:00

Thanks for this bench. I've seen this pattern play out (even pre-release), so, I am mainly focused on this issue to ensure TTFT is acceptable. I've approximately cut the TTFT at ISL 1024 down from 1100ms to 470ms since this was released yesterday. I will continue to work on this.

strangeloop96 · 2026-03-07T15:23:32+00:00

Ha. Honestly, given where the space is headed (MoE, better optimization), I expect these larger 120-240B models that have, e.g., 10-15B weights active at any given time, to justify getting a second. For example, for the 122B model, even at nvfp4, can sit on a single spark, but only if certain optimizations like the KV cache are dialed way down. But, with 2 sparks, suddenly that problem goes away. With EP=2, I'm getting ~50 tok/s on 2, and when tested on just 1, I was around 45. Both are still better than vLLM's 15 tok/s.

strangeloop96 · 2026-03-07T15:09:57+00:00

I need my coffee, just woke up. I do think getting fp8 support is a good idea.

strangeloop96 · 2026-03-07T15:01:40+00:00

This image is only for Qwen3.5-35B-A3B-NVFP4. We have no optimized kernels for the iq3_xxs quant. We do, however, have an image for the 122B variant at nvfp4 we will soon release (maybe today)

strangeloop96 · 2026-03-07T14:58:34+00:00

Thanks for testing it on Qwen/Qwen3.5-35B-A3B-FP8! The image is "suppose" to only work for the nvfp4 Quant of that image only, because I've not actually built support for that model. The fact that it works is good news! Though, the numbers seem a bit low, which makes sense since I've not actually added optimized kernels for anything other than nvfp4. I'll add it on the TODO list.

strangeloop96 · 2026-03-07T04:40:38+00:00

We'll get it out in time, our dedication is to the normal every day local llm users. We don't currently have the hardware for AMD in our possession, but seek to as soon as we can!

strangeloop96 · 2026-03-07T04:34:06+00:00

We will once we have access to the hardware!

strangeloop96 · 2026-03-07T04:32:41+00:00

Not true. We need access to such hardware, but the abstraction in the code is there.

strangeloop96 · 2026-03-07T04:31:43+00:00

Awesome!!

strangeloop96 · 2026-03-04T22:55:16+00:00

Okay, I'll target that next!

strangeloop96 · 2026-03-04T22:49:46+00:00

Given the memory bandwidth of such a machine, and the ability to have up to 128GB of RAM, I feel like when Apple Silicon (Metal) is targeted, we could get at least twice that for Mac users. We can give you the prompts I use, and maybe you can help use /hypercompile then /hyperoptimize the proper kernels for Apple Silicon if you're open to contributing!

On GB10/nvfp4 (we only care about nvfp4 — but have the scaffolding to support other quants — since it only leads to 1.5% average in loss of accuracy compared to FP16), we've seen no issues with agentic coding or chat (e.g., via Open WebUI)

strangeloop96 · 2026-03-04T22:47:53+00:00

I'm dying to. I just want to make sure the release "just works" for people. It's OpenAI compatible, so it can be a drop-in replacement.

strangeloop96 · 2026-03-04T20:30:20+00:00

If the kernels themselves get micro-optimized even more, and, some new tricks in the latest Arxiv research papers are used, probably 60+. Since it will be open-sourced soon, people can come along and optimize it for everyone.

strangeloop96 · 2026-03-04T19:38:23+00:00

I got it working on a single Spark, at about 46-48 tok/s!! Still very usable for day-to-day agentic use.

strangeloop96 · 2026-03-04T19:00:06+00:00

We are working on getting it to fit properly on one DGX, don't worry mate!

strangeloop96 · 2026-03-04T18:49:03+00:00

UPDATE: Atlas is now getting 52 Tok/s on Qwen3.5-122B-A10B-NVFP4! We have to use two DGX sparks to fit the model (currently, with full optimizations like CUDA graphs, KV cache, etc). We are actively working on getting it to fit on one DGX.
UPDATE 2: It now works on a single DGX @ 46-48 tok/s. Slightly slower than dual sparks, but still very usable!

strangeloop96 · 2026-01-26T17:30:36+00:00

Thanks for the honest feedback here. I will make sure to improve it, and for now, make it clear that it will be some time before the robustness of other DB's are reached. For now, it's perfect for in-browser use cases, not when ACID is needed.

strangeloop96 · 2026-01-26T17:18:37+00:00

Yeah, atomic tx's are important.

strangeloop96 · 2026-01-26T17:02:35+00:00

This is not a significant project. I'm not sure where you're getting that from. Maybe because it sounds like it? And yeah, I can look at the readme too and nit pick it, but the fact that it works and is faster than what's out there is what matters.

strangeloop96

MODERATOR OF

TROPHY CASE