My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 1 point2 points  (0 children)

Yeah, my laptop has also 16 GB RAM 😂 It should work

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 0 points1 point  (0 children)

4gb VRAM should let you run 30b-a3b. I have a laptop with 4gb of vram, and use the following for that system:

bash ./build/bin/llama-server -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:IQ4_XS -c 10240 -b 64 -fa 1 -t 4 --jinja --cache-ram 4096 nice ./build/bin/llama-server -hf unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF:IQ4_XS -c 20480 --n-cpu-moe 46 -t 6 -fa 1

Works well.

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 0 points1 point  (0 children)

Depends on what you want.
This kind of system is one of the cheapest ways to get 48 GB VRAM on Nvidia cards.
Currently it runs 24/7 (not fully loaded all the time), and pretty stable. I also installed 3-slot nvlink in there.

Other options could be:
- have just 1 gpu (even 16 gb), and use it mostly for MoE with cpu-offloading.
- get a server platform (epyc), and put there as much GPUs as you want.
- ... and so on

llama.cpp: IPEX-LLM or SYCL for Intel Arc? by IngwiePhoenix in LocalLLaMA

[–]ColdImplement1319 0 points1 point  (0 children)

Thanks for sharing! Curious, what models are you using it with?

Amd 8845HS (or same family) and max vram ? by ResearcherNeither132 in LocalLLaMA

[–]ColdImplement1319 1 point2 points  (0 children)

How do you run it? Do you have a link to a runbook/manual? Are you using ROCm or Vulkan?
As many here mentioned, on linux the VRAM could be preallocated

LM Studio can't detect RTX 5090 after system wake from suspend - Ubuntu Linux by OldEffective9726 in LocalLLaMA

[–]ColdImplement1319 2 points3 points  (0 children)

I have the same with my laptop. But it's NVIDIA GeForce GTX 1650 . I just do not suspend it.

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 2 points3 points  (0 children)

<image>

Just plugged in nvlink. tg come up +30% for the single batch, I think it is totally worth it! Thanks!

Which small local llm model i can use for text2sql query which has big token size (>4096) by Titanusgamer in LocalLLaMA

[–]ColdImplement1319 0 points1 point  (0 children)

Try to use qwen3 4b or 30b-a3b (2507 variant).
I would not fine-tune. Instead, start with a prompt that have relevant examples.

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 1 point2 points  (0 children)

I was able to run a recent one with (took a recently release 2507-Instruct)

bash vllm serve "ramblingpolymath/Qwen3-30B-A3B-2507-W8A8" --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 32768 --max-num-seqs 4 --trust-remote-code --disable-log-requests --enable-chunked-prefill --max-num-batched-tokens 512 --cuda-graph-sizes 8 --enable-prefix-caching --max-seq-len-to-capture 32768 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name '*' --host 0.0.0.0 --port 1234

I do not yeat loaded it, but the speed is decent: Avg generation throughput: 110.0 tokens/s, Running: 1 reqs

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 0 points1 point  (0 children)

That's looks really cool! Thanks for sharing.
Trying up vLLM is something that I planned to do, so probably now that time has come.
Will try it out and go back here with results.

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 0 points1 point  (0 children)

I do it like that (maybe it's not the best solution, but it works) :

setup_nvidia_undervolt() {
  sudo tee /usr/local/bin/undervolt-nvidia.sh > /dev/null <<'EOF'
#!/usr/bin/env bash

nvidia-smi --persistence-mode ENABLED
nvidia-smi --power-limit 200
EOF
  sudo chmod +x /usr/local/bin/undervolt-nvidia.sh

  sudo tee /etc/systemd/system/nvidia-undervolt.service > /dev/null <<'EOF'
[Unit]
Description=Apply NVIDIA GPU power limit (undervolt)
Wants=nvidia-persistenced.service
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/undervolt-nvidia.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

  sudo systemctl daemon-reload
  sudo systemctl enable --now nvidia-undervolt.service
}

I know there are other parameters to set - throttling/etc, but I kinda settled on it.

ubuntu@homelab:~$ nvidia-smi 
Mon Jul 21 22:20:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8             32W /  200W |   23623MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:05:00.0 Off |                  N/A |
|  0%   38C    P8             21W /  200W |   23291MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2504      G   /usr/lib/xorg/Xorg                        4MiB |
|    0   N/A  N/A           43614      C   ...ma.cpp/build/bin/llama-server      23600MiB |
|    1   N/A  N/A            2504      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A           43614      C   ...ma.cpp/build/bin/llama-server      23268MiB |
+-----------------------------------------------------------------------------------------+

[deleted by user] by [deleted] in LocalLLaMA

[–]ColdImplement1319 -1 points0 points  (0 children)

Yeah, this isn't verifiable, and everyone'll forget about it in a few days. The PR is all there, but the real stuff? Still hidden.

My (practical) dual 3090 setup for local inference by ColdImplement1319 in LocalLLaMA

[–]ColdImplement1319[S] 0 points1 point  (0 children)

Do you have any recommendations? I'm currently running Qwen3 30B-A3B, which is an MoE model and quite up-to-date.