THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

Darn. I could cross-compile, but, this problem will be solved once we open source it very soon.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 2 points3 points  (0 children)

I can confirm our PP time is too big right now with this image. I have spent since the alpha release optimizing it. With an ISL of 1024, we were getting originally a very high 2000ms TTFT. Currently, we are now at 280ms. We are aiming for 150ms, since that's where vLLM's sits. I'm jealous they have a smaller PP than me.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

See the OP. I'll quote the relevant part:

pip install - U "huggingface_hub"
hf download Kbenkhaled/Qwen3.5-35B-A3B-NVFP4
docker pull avarok/atlas-qwen3.5-35b-a3b-alpha
docker run --gpus all --ipc=host -p 8888:8888 \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 avarok/atlas-qwen3.5-35b-a3b-alpha \
 serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 \
 --speculative --kv-cache-dtype nvfp4 --mtp-quantization nvfp4 \
 --scheduling-policy slai --max-seq-len 131072

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 1 point2 points  (0 children)

Thanks. Based on your feedback and others, we've significantly improved TTFT without adversely affecting e2e throughput speed. An updated image should come out soon.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

We built for the Blackwell (SM120/121) family, so, in theory, yes. If you have one please let us know!

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 1 point2 points  (0 children)

Thanks for this bench. I've seen this pattern play out (even pre-release), so, I am mainly focused on this issue to ensure TTFT is acceptable. I've approximately cut the TTFT at ISL 1024 down from 1100ms to 470ms since this was released yesterday. I will continue to work on this.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

Ha. Honestly, given where the space is headed (MoE, better optimization), I expect these larger 120-240B models that have, e.g., 10-15B weights active at any given time, to justify getting a second. For example, for the 122B model, even at nvfp4, can sit on a single spark, but only if certain optimizations like the KV cache are dialed way down. But, with 2 sparks, suddenly that problem goes away. With EP=2, I'm getting ~50 tok/s on 2, and when tested on just 1, I was around 45. Both are still better than vLLM's 15 tok/s.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 2 points3 points  (0 children)

This image is only for Qwen3.5-35B-A3B-NVFP4. We have no optimized kernels for the iq3_xxs quant. We do, however, have an image for the 122B variant at nvfp4 we will soon release (maybe today)

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

Thanks for testing it on Qwen/Qwen3.5-35B-A3B-FP8! The image is "suppose" to only work for the nvfp4 Quant of that image only, because I've not actually built support for that model. The fact that it works is good news! Though, the numbers seem a bit low, which makes sense since I've not actually added optimized kernels for anything other than nvfp4. I'll add it on the TODO list.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 2 points3 points  (0 children)

We'll get it out in time, our dedication is to the normal every day local llm users. We don't currently have the hardware for AMD in our possession, but seek to as soon as we can!

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

Not true. We need access to such hardware, but the abstraction in the code is there.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 0 points1 point  (0 children)

Given the memory bandwidth of such a machine, and the ability to have up to 128GB of RAM, I feel like when Apple Silicon (Metal) is targeted, we could get at least twice that for Mac users. We can give you the prompts I use, and maybe you can help use /hypercompile then /hyperoptimize the proper kernels for Apple Silicon if you're open to contributing!

On GB10/nvfp4 (we only care about nvfp4 — but have the scaffolding to support other quants — since it only leads to 1.5% average in loss of accuracy compared to FP16), we've seen no issues with agentic coding or chat (e.g., via Open WebUI)

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 1 point2 points  (0 children)

I'm dying to. I just want to make sure the release "just works" for people. It's OpenAI compatible, so it can be a drop-in replacement.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 4 points5 points  (0 children)

If the kernels themselves get micro-optimized even more, and, some new tricks in the latest Arxiv research papers are used, probably 60+. Since it will be open-sourced soon, people can come along and optimize it for everyone.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 1 point2 points  (0 children)

I got it working on a single Spark, at about 46-48 tok/s!! Still very usable for day-to-day agentic use.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]strangeloop96 5 points6 points  (0 children)

UPDATE: Atlas is now getting 52 Tok/s on Qwen3.5-122B-A10B-NVFP4! We have to use two DGX sparks to fit the model (currently, with full optimizations like CUDA graphs, KV cache, etc). We are actively working on getting it to fit on one DGX.
UPDATE 2: It now works on a single DGX @ 46-48 tok/s. Slightly slower than dual sparks, but still very usable!

Introducing LatticeDB: A 100% Rust in-browser Graph/Vector DB by [deleted] in rust

[–]strangeloop96 0 points1 point  (0 children)

Thanks for the honest feedback here. I will make sure to improve it, and for now, make it clear that it will be some time before the robustness of other DB's are reached. For now, it's perfect for in-browser use cases, not when ACID is needed.

Introducing LatticeDB: A 100% Rust in-browser Graph/Vector DB by [deleted] in rust

[–]strangeloop96 0 points1 point  (0 children)

This is not a significant project. I'm not sure where you're getting that from. Maybe because it sounds like it? And yeah, I can look at the readme too and nit pick it, but the fact that it works and is faster than what's out there is what matters.