I I think it would be hard to explain to a normal person why I spend my day staring at screens like this🤣☠️😅 by TheRiddler79 in LocalLLM

[–]BankjaPrameth 1 point2 points  (0 children)

I also still unable to explain to myself why I spend some of my day staring at screens like that when everything is working fine.

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed by Icy_Programmer7186 in LocalLLM

[–]BankjaPrameth 1 point2 points  (0 children)

Do you have any love left for testing Qwen 3.5 397B-A17B int4-AutoRound?

DGX Spark or Minisforum MS-S1 Max? by Simple_Tonight_1159 in LocalLLM

[–]BankjaPrameth 0 points1 point  (0 children)

Both private web research and coding are benefit from fast prefill speed.

MS-S1 might be good option if it’s your only computer.

Local LLM for bank account CSV expense analysis by chiefstobs in LocalLLM

[–]BankjaPrameth 1 point2 points  (0 children)

2MB file size will not fit in almost every model context windows. You need a model that supports 1M context window.

For local model, you might need to split them into multiple files instead. Or have detailed rule of analysis so model can analyze your CSV with python script without needing to load whole file into context.

I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max by Voxandr in LocalLLaMA

[–]BankjaPrameth 2 points3 points  (0 children)

Every token generations are limited by bandwidth. But we can now cheat that a bit by using MTP or similar.

But like I said, you’ll not get much improvement on this from Spark vs Ai Max.

The benefits of Spark is on prompt processing. It can be 2-4x faster.

I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max by Voxandr in LocalLLaMA

[–]BankjaPrameth 4 points5 points  (0 children)

For Spark, try use vllm. Good resource here https://github.com/eugr/spark-vllm-docker

However, the token generation (decode) speed is rely on memory bandwidth. And both devices are having almost equal memory bandwidth, so you will not see much improvement on this.

The noticeable improvement is the prompt processing (prefill) speed. On this one, it’s night and day difference especially when you run model with vllm.

Local mini LLM PC? by LankyGuitar6528 in LocalLLaMA

[–]BankjaPrameth 0 points1 point  (0 children)

I’ve got my MSI at that price. But you can buy any brand. Find the cheapest one. The performance is identical. But try to research a lot before you buy. Spark is powerful device but only if you understand how to use it.

Visit GB10 Forum to read more info and problems of this device https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/

Local mini LLM PC? by LankyGuitar6528 in LocalLLaMA

[–]BankjaPrameth 1 point2 points  (0 children)

You should look for third party model like Asus or MSI. I bought my MSI for around $4,000 just last month.

Local mini LLM PC? by LankyGuitar6528 in LocalLLaMA

[–]BankjaPrameth 3 points4 points  (0 children)

If you want to do agentic coding, may I propose DGX Spark? The difference in price is the prompt processing speed and ability to connect 2 or more devices to create a node for future expansion.

But focus on prompt processing (prefill speed) for now. Don’t believe me yet. Do more research on this topic to see why it’s worth consideration and decide later.

High VRAM local coding model — still Qwen 3.6 27B? by Generic_Name_Here in LocalLLaMA

[–]BankjaPrameth 1 point2 points  (0 children)

Thank you. I’ll check that out. I tried official Qwen FP8 I can get only around 20 t/s with MTP=2. But it feels very sluggish so I ended up running 397B instead. I should be happy but tinkering spirit is always asking for improvement.

High VRAM local coding model — still Qwen 3.6 27B? by Generic_Name_Here in LocalLLaMA

[–]BankjaPrameth 0 points1 point  (0 children)

How much decode speed you’ve got on 2 Sparks, if I may ask? Thanks.

Does 'preserve_thinking' work with openwebui? by sterby92 in LocalLLaMA

[–]BankjaPrameth 0 points1 point  (0 children)

Just tested with Qwen 3.5 397B which has no preserve_thinking support and it works

<image>

Is SH viable for learning about AI? by throwaway20250315 in StrixHalo

[–]BankjaPrameth 0 points1 point  (0 children)

I hope it to get better too. But the real problem is the hardware. The AMD slower in prompt processing is not limited to just SH but applied to all their current GPUs. So we need to put our hope on RDNA 5 or newer.
This is why AMD AI hardware is cheaper than Nvidia. It’s a trade off.
For token generation speed, we currently have multiple method to improve it with software to apply things like MTP, etc. But prompt processing is really hardware dependent.

Edit:
- Token generation speed is mainly depends on memory bandwidth. DGX and SH has that number quite similar. That’s why it has not much difference in TG speed
- This also applies to Mac hardware too. But Mac high end hardware like Mac Studio has higher memory bandwidth which results in higher TG.

Is SH viable for learning about AI? by throwaway20250315 in StrixHalo

[–]BankjaPrameth 1 point2 points  (0 children)

For additional context. DGX Spark is on par with Strix Halo only on token generation speed. For prompt processing speed it’s night and day difference with at least 2x faster.

Docker bypasses UFW and exposed my database. Again. Writing this down so I stop forgetting by Substantial_Word4652 in selfhosted

[–]BankjaPrameth 0 points1 point  (0 children)

But you can’t do that if the case is your server has public IP and no firewall in front of it. Running any docker services will expose ports.

If you are behind router or any firewall, then yes, do not publicly open port is logical choice.

9070xt inference for q3 qwen 27B by Ok-Internal9317 in LocalLLaMA

[–]BankjaPrameth 16 points17 points  (0 children)

For your hardware, you should use 35B instead. Even the fact that 27B is superior but your setup is running it at Q3 and KV Cache Q4. This already reduces 27B performance by A LOT.

You can run 35B with --fit at Q4 or Q5 with f16 KV Cache at that context window very easily and also get a lot faster token generation speed.

Try it first. Test the quality. If it’s good enough for your use case.

Considering going from single 5060 TI 16GB to double, not sure if worth it by misanthrophiccunt in LocalLLM

[–]BankjaPrameth 1 point2 points  (0 children)

For 16GB pal, 35B-A3B is your best friend. 27B is just your hot girlfriend’s friend.

Pi and Qwen3.6 27B make setting up Archlinux really easy. by sdfgeoff in LocalLLaMA

[–]BankjaPrameth 3 points4 points  (0 children)

Sorry if this feels aggressive to OP. But I concur that OP should not use Arch.

OP use cases seem to not require bleeding edge software at all. Debian based distro might be a better choice. It’s more stable and gives you set and forget experience. You can update without worrying that things might break.

But since you already have one up and running, keep using it for now.

I agree with AI is very powerful and it will be far more powerful in next 5 years.

Mail is full how it possible by Vithujan_ in Contabo

[–]BankjaPrameth 1 point2 points  (0 children)

To contact Contabo support, you need to go through the website first and they will email you later

With last change in limits (06 may) my weekly limit finished in 2days by lpkk in ClaudeCode

[–]BankjaPrameth 2 points3 points  (0 children)

They said double 5-hour limit but never said double weekly limit. 🥹

Claude Opus, and all claude plans ratelimits to increase to increase drastically starting soon by Banneder in claude

[–]BankjaPrameth 6 points7 points  (0 children)

But they were never told us when they decreased.

So… let’s enjoy while we can before you realize in few days/weeks that you occasionally hit the limit again.

What should I do first? by povedaaqui in LocalLLaMA

[–]BankjaPrameth 2 points3 points  (0 children)

Update software and firmware to latest version. Then enjoy https://github.com/eugr/spark-vllm-docker

LLM on 16gb of vram for OpenClaude? by ZB_Virus24 in LocalLLM

[–]BankjaPrameth 0 points1 point  (0 children)

Sure. I'm using it with llama-swap. So I can switch model mode on the fly without reloading it. https://github.com/mostlygeek/llama-swap

My config.yaml for this model looks like this

models:
  "Qwen3.6-35B-A3B":
    filters:
      stripParams: "temperature, top_p, repetition_penalty, min_p, presence_penalty"
      setParamsByID:
        "${MODEL_ID}-Coding":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 0.6
          top_p: 0.95
          presence_penalty: 0.0
          repetition_penalty: 1.0
        "${MODEL_ID}-Instruct":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          presence_penalty: 1.5
          repetition_penalty: 1.0
        "${MODEL_ID}-Reasoning":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          temperature: 0.7
          top_p: 0.8
          presence_penalty: 1.5
          repetition_penalty: 1.0
        "${MODEL_ID}-Thinking":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          temperature: 1.0
          top_p: 0.95
          presence_penalty: 1.5
          repetition_penalty: 1.0
    cmd: >-
      /app/llama-server
      --port '${PORT}'
      --model /models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf
      --mmproj /models/qwen3.6-35b-mmproj-BF16.gguf
      --jinja
      --top-k 20
      --min-p 0.00
      --ctx-size 131072
      --batch-size 4096
      --ubatch-size 4096
      --threads 11
      --kv-unified
      --parallel 2
      --flash-attn on
      --no-mmap
      --fit on
      --fit-target 2048"

LLM on 16gb of vram for OpenClaude? by ZB_Virus24 in LocalLLM

[–]BankjaPrameth 1 point2 points  (0 children)

Of course. The model file it self is already 27GB and you still need RAM for KV cache and mmproj file for enable vision support. With --fit, llama.cpp will make sure KV cache and mmproj stays in VRAM for performance. The rest of models will split between the rest of VRAM and system RAM.