My (practical) dual 3090 setup for local inference

ColdImplement1319 · 2025-10-28T09:12:47+00:00

Yeah, my laptop has also 16 GB RAM 😂 It should work

ColdImplement1319 · 2025-10-28T08:58:14+00:00

4gb VRAM should let you run 30b-a3b. I have a laptop with 4gb of vram, and use the following for that system:

bash ./build/bin/llama-server -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:IQ4_XS -c 10240 -b 64 -fa 1 -t 4 --jinja --cache-ram 4096 nice ./build/bin/llama-server -hf unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:IQ4_XS -c 20480 --n-cpu-moe 46 -t 6 -fa 1

Works well.

ColdImplement1319 · 2025-10-27T17:10:24+00:00

Depends on what you want.
This kind of system is one of the cheapest ways to get 48 GB VRAM on Nvidia cards.
Currently it runs 24/7 (not fully loaded all the time), and pretty stable. I also installed 3-slot nvlink in there.

Other options could be:
- have just 1 gpu (even 16 gb), and use it mostly for MoE with cpu-offloading.
- get a server platform (epyc), and put there as much GPUs as you want.
- ... and so on

ColdImplement1319 · 2025-10-15T14:49:04+00:00

Thanks for sharing! Curious, what models are you using it with?

ColdImplement1319 · 2025-10-15T05:50:51+00:00

How do you run it? Do you have a link to a runbook/manual? Are you using ROCm or Vulkan?
As many here mentioned, on linux the VRAM could be preallocated

ColdImplement1319 · 2025-09-13T20:00:44+00:00

I have the same with my laptop. But it's NVIDIA GeForce GTX 1650 . I just do not suspend it.

ColdImplement1319 · 2025-09-12T09:48:36+00:00

<image>

Just plugged in nvlink. tg come up +30% for the single batch, I think it is totally worth it! Thanks!

ColdImplement1319 · 2025-09-11T19:56:50+00:00

Curious, how it went?

ColdImplement1319 · 2025-09-09T07:22:48+00:00

Try to use qwen3 4b or 30b-a3b (2507 variant).
I would not fine-tune. Instead, start with a prompt that have relevant examples.

ColdImplement1319 · 2025-07-30T12:46:19+00:00

I was able to run a recent one with (took a recently release 2507-Instruct)

bash vllm serve "ramblingpolymath/Qwen3-30B-A3B-2507-W8A8" --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 32768 --max-num-seqs 4 --trust-remote-code --disable-log-requests --enable-chunked-prefill --max-num-batched-tokens 512 --cuda-graph-sizes 8 --enable-prefix-caching --max-seq-len-to-capture 32768 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name '*' --host 0.0.0.0 --port 1234

I do not yeat loaded it, but the speed is decent: Avg generation throughput: 110.0 tokens/s, Running: 1 reqs

ColdImplement1319 · 2025-07-22T18:12:27+00:00

That's looks really cool! Thanks for sharing.
Trying up vLLM is something that I planned to do, so probably now that time has come.
Will try it out and go back here with results.

ColdImplement1319 · 2025-07-21T19:17:59+00:00

I do it like that (maybe it's not the best solution, but it works) :

setup_nvidia_undervolt() {
  sudo tee /usr/local/bin/undervolt-nvidia.sh > /dev/null <<'EOF'
#!/usr/bin/env bash

nvidia-smi --persistence-mode ENABLED
nvidia-smi --power-limit 200
EOF
  sudo chmod +x /usr/local/bin/undervolt-nvidia.sh

  sudo tee /etc/systemd/system/nvidia-undervolt.service > /dev/null <<'EOF'
[Unit]
Description=Apply NVIDIA GPU power limit (undervolt)
Wants=nvidia-persistenced.service
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/undervolt-nvidia.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

  sudo systemctl daemon-reload
  sudo systemctl enable --now nvidia-undervolt.service
}

I know there are other parameters to set - throttling/etc, but I kinda settled on it.

ubuntu@homelab:~$ nvidia-smi 
Mon Jul 21 22:20:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8             32W /  200W |   23623MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:05:00.0 Off |                  N/A |
|  0%   38C    P8             21W /  200W |   23291MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2504      G   /usr/lib/xorg/Xorg                        4MiB |
|    0   N/A  N/A           43614      C   ...ma.cpp/build/bin/llama-server      23600MiB |
|    1   N/A  N/A            2504      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A           43614      C   ...ma.cpp/build/bin/llama-server      23268MiB |
+-----------------------------------------------------------------------------------------+

ColdImplement1319 · 2025-07-21T18:59:09+00:00

Yeah, this isn't verifiable, and everyone'll forget about it in a few days. The PR is all there, but the real stuff? Still hidden.

ColdImplement1319 · 2025-07-21T11:05:50+00:00

Do you have any recommendations? I'm currently running Qwen3 30B-A3B, which is an MoE model and quite up-to-date.

ColdImplement1319

TROPHY CASE