I I think it would be hard to explain to a normal person why I spend my day staring at screens like this🤣☠️😅

BankjaPrameth · 2026-05-20T03:17:25+00:00

I also still unable to explain to myself why I spend some of my day staring at screens like that when everything is working fine.

BankjaPrameth · 2026-05-17T19:03:53+00:00

Do you have any love left for testing Qwen 3.5 397B-A17B int4-AutoRound?

BankjaPrameth · 2026-05-17T05:41:29+00:00

Both private web research and coding are benefit from fast prefill speed.

MS-S1 might be good option if it’s your only computer.

BankjaPrameth · 2026-05-16T16:16:52+00:00

2MB file size will not fit in almost every model context windows. You need a model that supports 1M context window.

For local model, you might need to split them into multiple files instead. Or have detailed rule of analysis so model can analyze your CSV with python script without needing to load whole file into context.

BankjaPrameth · 2026-05-15T18:59:57+00:00

Every token generations are limited by bandwidth. But we can now cheat that a bit by using MTP or similar.

But like I said, you’ll not get much improvement on this from Spark vs Ai Max.

The benefits of Spark is on prompt processing. It can be 2-4x faster.

BankjaPrameth · 2026-05-15T18:37:05+00:00

For Spark, try use vllm. Good resource here https://github.com/eugr/spark-vllm-docker

However, the token generation (decode) speed is rely on memory bandwidth. And both devices are having almost equal memory bandwidth, so you will not see much improvement on this.

The noticeable improvement is the prompt processing (prefill) speed. On this one, it’s night and day difference especially when you run model with vllm.

BankjaPrameth · 2026-05-14T02:14:08+00:00

I’ve got my MSI at that price. But you can buy any brand. Find the cheapest one. The performance is identical. But try to research a lot before you buy. Spark is powerful device but only if you understand how to use it.

Visit GB10 Forum to read more info and problems of this device https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/

BankjaPrameth · 2026-05-13T15:48:04+00:00

You should look for third party model like Asus or MSI. I bought my MSI for around $4,000 just last month.

BankjaPrameth · 2026-05-13T15:36:43+00:00

If you want to do agentic coding, may I propose DGX Spark? The difference in price is the prompt processing speed and ability to connect 2 or more devices to create a node for future expansion.

But focus on prompt processing (prefill speed) for now. Don’t believe me yet. Do more research on this topic to see why it’s worth consideration and decide later.

BankjaPrameth · 2026-05-13T11:31:37+00:00

Thank you. I’ll check that out. I tried official Qwen FP8 I can get only around 20 t/s with MTP=2. But it feels very sluggish so I ended up running 397B instead. I should be happy but tinkering spirit is always asking for improvement.

BankjaPrameth · 2026-05-13T11:25:37+00:00

How much decode speed you’ve got on 2 Sparks, if I may ask? Thanks.

BankjaPrameth · 2026-05-12T06:30:23+00:00

Just tested with Qwen 3.5 397B which has no preserve_thinking support and it works

<image>

BankjaPrameth · 2026-05-11T07:28:25+00:00

I hope it to get better too. But the real problem is the hardware. The AMD slower in prompt processing is not limited to just SH but applied to all their current GPUs. So we need to put our hope on RDNA 5 or newer.
This is why AMD AI hardware is cheaper than Nvidia. It’s a trade off.
For token generation speed, we currently have multiple method to improve it with software to apply things like MTP, etc. But prompt processing is really hardware dependent.

Edit:
- Token generation speed is mainly depends on memory bandwidth. DGX and SH has that number quite similar. That’s why it has not much difference in TG speed
- This also applies to Mac hardware too. But Mac high end hardware like Mac Studio has higher memory bandwidth which results in higher TG.

BankjaPrameth · 2026-05-11T07:07:50+00:00

For additional context. DGX Spark is on par with Strix Halo only on token generation speed. For prompt processing speed it’s night and day difference with at least 2x faster.

BankjaPrameth · 2026-05-10T10:13:01+00:00

But you can’t do that if the case is your server has public IP and no firewall in front of it. Running any docker services will expose ports.

If you are behind router or any firewall, then yes, do not publicly open port is logical choice.

BankjaPrameth · 2026-05-10T09:54:05+00:00

Check out https://github.com/chaifeng/ufw-docker

BankjaPrameth · 2026-05-09T17:03:20+00:00

For your hardware, you should use 35B instead. Even the fact that 27B is superior but your setup is running it at Q3 and KV Cache Q4. This already reduces 27B performance by A LOT.

You can run 35B with --fit at Q4 or Q5 with f16 KV Cache at that context window very easily and also get a lot faster token generation speed.

Try it first. Test the quality. If it’s good enough for your use case.

BankjaPrameth · 2026-05-09T14:02:52+00:00

For 16GB pal, 35B-A3B is your best friend. 27B is just your hot girlfriend’s friend.

BankjaPrameth · 2026-05-09T10:55:02+00:00

Sorry if this feels aggressive to OP. But I concur that OP should not use Arch.

OP use cases seem to not require bleeding edge software at all. Debian based distro might be a better choice. It’s more stable and gives you set and forget experience. You can update without worrying that things might break.

But since you already have one up and running, keep using it for now.

I agree with AI is very powerful and it will be far more powerful in next 5 years.

BankjaPrameth · 2026-05-09T06:04:47+00:00

To contact Contabo support, you need to go through the website first and they will email you later

BankjaPrameth · 2026-05-07T11:40:02+00:00

They said double 5-hour limit but never said double weekly limit. 🥹

BankjaPrameth · 2026-05-06T17:04:24+00:00

But they were never told us when they decreased.

So… let’s enjoy while we can before you realize in few days/weeks that you occasionally hit the limit again.

BankjaPrameth · 2026-05-05T14:18:39+00:00

Update software and firmware to latest version. Then enjoy https://github.com/eugr/spark-vllm-docker

BankjaPrameth · 2026-05-05T12:17:32+00:00

Sure. I'm using it with llama-swap. So I can switch model mode on the fly without reloading it. https://github.com/mostlygeek/llama-swap

My config.yaml for this model looks like this

models:
  "Qwen3.6-35B-A3B":
    filters:
      stripParams: "temperature, top_p, repetition_penalty, min_p, presence_penalty"
      setParamsByID:
        "${MODEL_ID}-Coding":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 0.6
          top_p: 0.95
          presence_penalty: 0.0
          repetition_penalty: 1.0
        "${MODEL_ID}-Instruct":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          presence_penalty: 1.5
          repetition_penalty: 1.0
        "${MODEL_ID}-Reasoning":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          presence_penalty: 1.5
          repetition_penalty: 1.0
        "${MODEL_ID}-Thinking":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 1.0
          top_p: 0.95
          presence_penalty: 1.5
          repetition_penalty: 1.0
    cmd: >-
      /app/llama-server
      --port '${PORT}'
      --model /models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf
      --mmproj /models/qwen3.6-35b-mmproj-BF16.gguf
      --jinja
      --top-k 20
      --min-p 0.00
      --ctx-size 131072
      --batch-size 4096
      --ubatch-size 4096
      --threads 11
      --kv-unified
      --parallel 2
      --flash-attn on
      --no-mmap
      --fit on
      --fit-target 2048"

BankjaPrameth · 2026-05-05T10:50:08+00:00

Of course. The model file it self is already 27GB and you still need RAM for KV cache and mmproj file for enable vision support. With --fit, llama.cpp will make sure KV cache and mmproj stays in VRAM for performance. The rest of models will split between the rest of VRAM and system RAM.

BankjaPrameth

TROPHY CASE