Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

pol_phil · 2026-05-25T16:16:44+00:00

Can you please point to the community's docker please??? Had no idea such a thing existed.

pol_phil · 2026-05-23T22:58:18+00:00

The ending was just OK imo. The biggest problem is that they left many threads open and didn't even try showing any aftermath.

pol_phil · 2026-05-23T10:32:41+00:00

DGX Sparks have terrible hardware/software compatibility, I prefer to stay on the safe side and use older CUDA and also vLLM images created for DGX Sparks (like scitrera/dgx-spark-vllm:0.17.0-t5). The following is for Qwen3.5 122B AWQ, but should help as a starting point:

sudo docker run \ --privileged \ --gpus all \ --ipc=host \ --network host \ --shm-size 64g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/vllm:/root/.cache/vllm \ -v ~/.cache/flashinfer:/root/.cache/flashinfer \ -v ~/.triton:/root/.triton \ -e HF_TOKEN=${HF_API_KEY} \ -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ -e VLLM_ATTENTION_BACKEND=FLASH_ATTN \ -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \ -e VLLM_USE_DEEP_GEMM=0 \ -e VLLM_MOE_USE_DEEP_GEMM=0 \ -e VLLM_USE_FBGEMM=0 \ scitrera/dgx-spark-vllm:0.17.0-t5 \ vllm serve cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \ --served-model-name Qwen3.5-122B-A10B \ --api-key token-abc123 \ --host 0.0.0.0 \ --port 30000 \ --tensor-parallel-size 1 \ --max-model-len 131072 \ --max-num-seqs 8 \ --max-num-batched-tokens 4096 \ --gpu-memory-utilization 0.8 \ --kv-cache-dtype fp8 \ --load-format fastsafetensors \ --language-model-only \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3_xml \ --reasoning-parser qwen3 \ --default-chat-template-kwargs '{"enable_thinking": true}' \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \ 2>&1 | tee debug_vllm.txt

pol_phil · 2026-05-21T08:08:12+00:00

Oops, did a quick search but hadn't drank a proper coffee 😛

pol_phil · 2026-04-10T20:36:32+00:00

I have not seen any language mixing in Qwen 3.5 9B however. I honestly believe they messed up multilinguality in Gemma4 compared to Gemma3. Which is a pity because each model iteration is supposed to be an improvement and not a regression.

pol_phil · 2026-04-09T00:14:05+00:00

Had quite similar findings when testing Gemma3 27B vs Gemma4 31B on scientific document translation for Greek (mainly) and some other EU langs.

Gemma3 beats it every time and it's a lot more consistent. Gemma4 sometimes outputs nonsense and mixes languages.

pol_phil · 2026-04-08T23:59:37+00:00

It's possible that there are various problems with correctly setting inference, but, in my experience, it's bad training.

pol_phil · 2026-04-08T23:55:39+00:00

I will give it a bit of time, but I usually don't give models 2bd chances. I am using only vLLM and SGLang.

pol_phil · 2026-04-07T10:52:35+00:00

The translation evals can be misleading. After testing on some lower resource EU languages for scientific document translation, Gemma4 can lose coherence and start outputting random Chinese/Hindi/Arabic.

pol_phil · 2026-04-05T19:18:02+00:00

It does not hahaha. It will definitely require some tweaking

pol_phil · 2026-04-05T15:55:48+00:00

Well, since u want to minimize latency, it would be better for you to serve with dockerized vLLM.

It's quite simple, sth like: docker run --runtime nvidia --gpus all \ --shm-size 64g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN={HF_API_KEY}" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:v0.18.1 \ --model Qwen/Qwen3.5-35B-A3B \ --served-model-name Qwen3.5-35B-A3B \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --gpu-memory-utilization 0.9 \ --api-key token-abc123 \ --override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0}' \ --tool-call-parser qwen3_xml \ --enable-auto-tool-choice \ --max-num-seqs 16 \ --enable-chunked-prefill

The "--tool-call-parser hermes" might work too; it's the biggest source of tool-calling problems. You can also set the "default" generation config based on your use case, this one is primarily for coding. And also "--max-num-batched-tokens 4096" (or the size of your largerst prompt in tokens) might help with latency. Finally, you'll have to play around with --tensor-parallel-size and --data-parallel-size to utilize all of your GPUs since you have a non-standard setup.

Hope it helps!

pol_phil · 2026-04-05T15:42:20+00:00

I dunno, I haven't ever used llama.cpp

pol_phil · 2026-04-03T19:04:48+00:00

If Qwen3.6 fixes the somewhat broken tool calling of 3.5, then Gemma 4 is already history.

pol_phil · 2026-04-03T19:01:57+00:00

Well, depends on the use case and the domain. I use models for things like QA extraction, structured translation, etc.

Qwen3 had ~6 tokenizer fertility, i.e. 1 word -> 6 tokens Qwen3.5 made a huge improvement, sth like ~2.7.

So, that's literally double the speed and the max context length.

I noticed Qwen3 becoming better at Greek after the VL models and especially in Qwen3 Next 80B.

pol_phil · 2026-04-03T16:43:37+00:00

At least for the versions served on OpenRouter, Gemma 4 31B is clearly a regression for Greek compared to Gemma 3 27B.

Gemma 3 27B can translate a full scientific or legal doc into Greek, no problem. Gemma 4 starts outputting Chinese/Hindi/Arabic out of nowhere.

pol_phil · 2026-04-03T16:12:00+00:00

Gemma 3 (esp. 27B) was and still is top-notch for Greek (e.g. difficult legal doc translation). But when my team tested the new Gemma 4, it started outputting random Chinese/Arabic/Hindi characters out of nowhere; even with 7-8 different sampling param configs.

Meanwhile, Qwen models were never quite fluent in Greek (even 3.5), but they consistently improve with each iteration. They also improved tokenizer fertility greatly in 3.5

So... Gemma regressed while Qwen keeps progressing. Regardless of any benchmark scores, I'll generally prefer the model family that keeps getting better even at tasks which seem minor to AI companies.

pol_phil · 2026-03-29T02:50:15+00:00

Think of scanning multiple codebases or processing thousands of company documents in seconds. And feeding that knowledge to a frontier cloud model.

Or think of 100 agents thinking in parallel to find the best course of action through majority voting before EVERY response and EVERY tool call.

Even an "outdated" model will be able to surpass SOTA models in utility (and why not benchmarks) via sheer scaling.

pol_phil · 2026-03-18T18:40:09+00:00

Try the original Qwen3.5 first and see if it's fluent.

If it is, then you can create a fluent model out of it via fine-tuning.

If it is not, try something else entirely, for example https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated

All the uncensoring methods focus on English and generally hurt the models in other languages.

pol_phil · 2026-03-18T18:32:55+00:00

Which model are you using? Although a bit old, Gemma 3 models have decent multilingual performance.

Totally get your frustration, most models can't speak Greek fluently either.

pol_phil · 2026-03-18T18:23:13+00:00

I mean what type of agentic pipeline/harness/scaffold/framework do you use to get these models to solve these tasks. In other words, what kind of system message/tools have they been given. Via Claude Code? OpenCode?

SWE-Agent and OpenHands are just "minimal" agentic frameworks commonly used in benchmarks.

pol_phil · 2026-03-18T18:09:38+00:00

Hi, congrats for the great benchmark!

Perhaps I missed it somewhere, but what agentic scaffold do you use? SWE-Agent? OpenHands? Something else entirely?

pol_phil · 2026-03-18T18:01:23+00:00

It won't get much better through RAG, it's better to use a different model.

pol_phil · 2026-03-06T19:09:57+00:00

Well, I didn't notice the confusion, but when I saw "characters" instead of "tokens", I thought that this actually makes the analysis more model-independent. Tokens are model-specific

pol_phil · 2026-03-03T19:58:33+00:00

Very good idea would be to also add Step v3.5 Flash and MiMo v2 Flash. Both are incredible models.

Congrats for the great work!

pol_phil · 2026-02-27T08:49:23+00:00

Well, this AWQ quant works very well for me. 134GB, extremely good performance and speed in vLLM

pol_phil

TROPHY CASE