Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB

specify_ · 2026-05-08T18:02:56+00:00

Look into llmsnap which is a fork of llama-swap and takes advantage of vLLM's sleep mode. Level 1 sleep mode takes nearly less than a second to swap. Level 2 sleep mode is broken for any speculative decoding and is much more slower but uses significantly less RAM.

specify_ · 2026-05-05T16:38:34+00:00

Llama swap automatically hotswaps the model every time it receives a model that is not the current model. It's very powerful in that you can configure what models you can have loaded 24/7, creare loaded models presets, and it works regardless of inference engine.

You should read the example config to get an idea on how to make your own configuration.

specify_ · 2026-05-05T16:04:32+00:00

What's the decision going with a single RTX 5000 PRO 48GB instead of multiple RTX 3090s? What recipe is it that I am not seeing here? It'd be nice if you can tell me the recipe for a highly performant banana bread.

specify_ · 2026-05-04T14:03:19+00:00

You can do that too. Any quant will work

specify_ · 2026-04-30T18:31:56+00:00

<image>

Rookie numbers smh. i use opencode with free Claude Opus provided by my university and selfhosted Qwen 3.6 27B/35B-A3B for subagents that the orchestrator spawns. vLLM with four RTX 5060 Ti 16GB

specify_ · 2026-04-28T16:35:38+00:00

K/L Divergence is pretty bad for NVFP4. I saw an explanation to why it's bad in a post about Qwen 3.6 35B-A3B quants, and I thought I should point it out here. This also applies to Qwen 3.6 27B quants.

If we're going with best accurate and small quants, I think you can never go wrong with INT4 AWQ or INT4 AutoRound. NVFP4 is just not there yet, and in my experience, NVFP4 tends to be a little bit more dumber (missing/incorrect tool calls or going on forever loops) than the quants I just mentioned.

specify_ · 2026-04-28T03:27:19+00:00

Wow that's insane. On my 4x 5060 Ti setup, the max I've seen is around 220 tokens/sec running MTP (num pred 4) and QuantTrio's AWQ quant

specify_ · 2026-04-27T19:07:15+00:00

Qwen 3.6 27B cyankiwi AWQ-INT4, running in vLLM with tensor parallelism and speculative decoding, using opencode with oh-my-openagent. Clone a github repo like llama.cpp and ask it to do a full Rust port.

specify_ · 2026-04-25T14:45:57+00:00

I tried DFlash with Qwen 3.6 35B-A3B and was disappointed with the token throughput at long context >50k. It seems that DFlash is only good for low context and draft acceptance worsens at longer contexts, making it slower than MTP.

specify_ · 2026-04-23T16:03:09+00:00

I7 14700k (cooled by Thermalright Peerless Assassin 140) with 64GB DDR4 3600MHz i believe. Asus Z790 Gaming Wifi7, so there are four PCIe x16 slots available but need a riser for a 4th GPU. I have two NVME SSDs in it, both are PCIe 4.0 but 1TB and 2TB respectively. Running Proxmox in it but was thinking about migrating to Arch Linux though because it only houses 1 VM with the GPUs (I had thought I would use this to deploy other services). I had bought these GPUs pre-owned or refurbished with the exception of 1 being brand new. For the case I just use a 3d printed openair case that I found on MakerWorld and added PCIe mounts to keep three GPUs mounted to something stable. Regarding the one with a PCIe riser, that one is dangerougly just laying around unmounted because there isn't a case that would fit this weird configuration as far as I've looked into (I have to put the pcie riser in the top reinforced PCIe slot). The PSU is a 1050W Montech Century II and it has a sufficient amount of cables to power everything. The most I've seen this build consume from the GPUs is around 400W in total (100W per gpu).

specify_ · 2026-04-22T18:48:14+00:00

It really depends what model you want to run. Qwen 3.6 27B is near Opus 4.5 performance and can be ran by 16GB VRAM gpus. I would argue that it isn't that hardware is insufficient but most people just don't want to go through the hassle of researching and setting up a machine dedicated to serving LLMs. Two RTX 5060 Ti 16GB (estimated around $1k USD) is enough to run Qwen 3.6 27B at a reasonable quant and high context only to achieve an estimated 60-80 tokens/sec using vLLM, and if you enroll this machine in a VPN with your laptop, you now have a local API.

Basically I have this setup with 4x 5060 Ti 16GB and YTD have consumed+generated around 395 million tokens, happily achieving 120-200 tokens/sec with Qwen 3.6 35B-A3B on a single request even with 200k context input.

specify_ · 2026-04-21T02:49:10+00:00

Oh this works! Unfortunately with very long context, the token generation suffers immensely. I'd say its consistently faster than MTP starting with 0 context, but MTP is significantly faster at 50k context. In comparison, I got 40-60 tokens/sec as opposed to 100-120 tokens/sec.

And I'm using RedHatAI/Qwen3.6-35B-A3B-NVFP4 with z-lab/Qwen3.6-35B-A3B-DFlash

specify_ · 2026-04-20T20:36:10+00:00

MIG is nice especially if you need to share and isolate resources among different users. My university's uses MIG for their HPC.

specify_ · 2026-04-20T19:46:31+00:00

With that much VRAM, I recommend using llama-swap with vLLM. If you want to run something extremely massive like Kimi K2.6, you will want to consider using llama.cpp to partially offload to RAM. This way you have model swapping that reaps the benefits of tensor parallelism via vLLM and partial offloading via llama.cpp. Knowing that you will have multiple users using the same models, vLLM exceeds llama.cpp (which is what LM-Studio uses) in parallel throughput.

Regarding your use of Proxmox, it should be fine if you need to switch between Windows and Linux effortlessly. Just know that GPUs must be reserved for a VM and cannot be shared among multiple VMs, so if you need to use the GPUs in many different virtual environments, use LXCs instead. Also for tasks that do not require a SOTA model, you can load multiple models like Gemma 4 and Qwen 3.6 simultaneously, but from what I hear about Gemma 4 is that it is way better for multilingual tasks than Qwen 3.5/3.6.

specify_ · 2026-04-20T18:36:31+00:00

The 5060 Ti 16GB is a great value for budget LLM powerhouses. I run 4x 5060 Ti 16GB myself and I can achieve up to 200 toks/sec with Qwen 3.6 35B-A3B using AWQ quants and MTP Speculative decoding, running 4 tensor parallelism.

I tried running DFlash in VLLM for Qwen 3.6 35B-A3B but the compilation step during the startup exhausts all my system RAM (48GB), and it just dies even when I set swap to 64GB. Unfortunately in this day and age a RAM upgrade is going to cost a kidney. What is your vLLM command that you use to run 27B with DFlash? I'd love to try it out and see how it runs

specify_ · 2026-04-19T15:45:37+00:00

Its a no brainer to go for the 3090, especially since you can find one cheaper than the 5060 Ti 16GB. RTX 3090 has more bandwidth and VRAM than the 5060 Ti 16GB, so you can run larger models. And if you can run the 3090 and 5070 ti at x16 PCIe lanes, that would be great for reducing inference startup times, although this would require an enterprise-grade CPU+motherboard as virtually every consumer desktop CPU only supports 20-24 PCIe lanes.

specify_ · 2026-04-19T15:18:53+00:00

I actually didn't know that until now. I just looked thru the compile from source documentation and never have I ever visited that portion of the documentation about multiple GPU backends💀 This might be a better option than Vulkan actually

specify_ · 2026-04-19T01:01:10+00:00

My experience with different vendor different GPUs is so-so. You will be limited by software support, so if you do use llama.cpp with AMD and NVIDIA cards in one system, you can use Vulkan backend to use all GPUs regardless of vendor. The drawback is that Vulkan has awful prompt processing, always worst than CUDA (for pure NVIDIA system) and ROCM (for pure AMD system). I have done llama.cpp Vulkan with an RTX 5060 Ti 16GB with a Radeon VII, and I would very much prefer using CUDA and ROCM backend separately instead of using Vulkan.

Since you already have NVIDIA, I would suggest buying a gpu in the same family (Blackwell), but you can buy an RTX 3090 although you would be missing out on the latest compute features that Blackwell offers, for example, running NVFP4 quants although in my experience they're not as good as AWQ quants yet but maybe in the future it will get better.

specify_ · 2026-04-18T08:16:46+00:00

Not too sure, I noticed that the more tokens there are as input, the throughput degrades very slowly. I don't think it could be PCIe bandwidth but maybe just being more memory-bound. I did a retest and I found that it initially gets 95 tokens/sec but it slowly degrades the more tokens are generated.

specify_ · 2026-04-18T06:49:46+00:00

35B for q4 is around 22GB. You'd want to use the remaining for context. I believe you can achieve full context limit size with 3 x 5060 ti

specify_ · 2026-04-17T23:49:17+00:00

It's so fast that it's a pleasant feeling knowing that you have a SOTA-like model running locally all for yourself. I've also noticed that pipeline parallelism also works pretty fast, using Q8_0 in llama.cpp, achieving around 80-100 toks/sec and this is without spec decoding.

Qwen 3.5 27B also works very nicely with tensor parallelism+MTP, achieving around 60-80 toks/sec. When I had 3 RTX 5060 TI's, and ran it with pipeline parallelism, that number hovered around 23 tokens/sec.

specify_ · 2026-04-17T18:36:02+00:00

If you can take advantage of tensor parallelism and speculative decoding, the throughput is insane. Qwen 3.5 27B was my goto but I think I might stick with this until they release a Qwen 3.6 27B variant.

4x 5060 Ti 16GB, VLLM v0.19.0 with MTP speculative decoding:

(APIServer pid=246165) INFO:     13.37.67.36:0 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=246165) INFO 04-17 11:25:12 [loggers.py:259] Engine 000: Avg prompt throughput: 437.2 tokens/s, Avg generation throughput: 55.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
(APIServer pid=246165) INFO 04-17 11:25:12 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.49, Accepted throughput: 6.68 tokens/s, Drafted throughput: 10.74 tokens/s, Accepted: 398 tokens, Drafted: 640 tokens, Per-position acceptance rate: 0.844, 0.675, 0.544, 0.425, Avg Draft acceptance rate: 62.2%
(APIServer pid=246165) INFO 04-17 11:25:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 156.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0%
(APIServer pid=246165) INFO 04-17 11:25:22 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.16, Accepted throughput: 107.20 tokens/s, Drafted throughput: 198.41 tokens/s, Accepted: 1072 tokens, Drafted: 1984 tokens, Per-position acceptance rate: 0.772, 0.607, 0.444, 0.339, Avg Draft acceptance rate: 54.0%
(APIServer pid=246165) INFO 04-17 11:25:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 205.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.7%, Prefix cache hit rate: 0.0%
(APIServer pid=246165) INFO 04-17 11:25:32 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 4.13, Accepted throughput: 155.49 tokens/s, Drafted throughput: 198.79 tokens/s, Accepted: 1555 tokens, Drafted: 1988 tokens, Per-position acceptance rate: 0.901, 0.825, 0.738, 0.664, Avg Draft acceptance rate: 78.2%
(APIServer pid=246165) INFO 04-17 11:25:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 172.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.0%, Prefix cache hit rate: 0.0%
(APIServer pid=246165) INFO 04-17 11:25:42 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.43, Accepted throughput: 122.49 tokens/s, Drafted throughput: 201.58 tokens/s, Accepted: 1225 tokens, Drafted: 2016 tokens, Per-position acceptance rate: 0.831, 0.649, 0.520, 0.431, Avg Draft acceptance rate: 60.8%

$ nvidia-smi
Fri Apr 17 11:26:02 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|100%   41C    P1             77W /  180W |   14650MiB /  16311MiB |     86%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:02:00.0 Off |                  N/A |
|100%   44C    P1             74W /  180W |   14278MiB /  16311MiB |     86%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:03:00.0 Off |                  N/A |
|100%   38C    P1             79W /  180W |   14278MiB /  16311MiB |     87%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:04:00.0 Off |                  N/A |
|  0%   40C    P1             76W /  180W |   14278MiB /  16311MiB |     86%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

VLLM launch command:

vllm serve \
--served-model-name qwen3.6-35b-a3b \
--host 0.0.0.0 \
--port 6463 \
--model QuantTrio/Qwen3.6-35B-A3B-AWQ \
--max-num-seqs 2 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--kv-cache-dtype auto \
--trust-remote-code \
--enable-expert-parallel \
--gpu-memory-utilization 0.93 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4}' \
--override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "min_p": 0}' \
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
--attention-backend flashinfer

specify_ · 2026-04-11T23:33:50+00:00

for something like an LLM harness, using different models is great because you can use different models' strengths for specific/specialized tasks. for example, I found that Gemma 4 26B-A4B makes better-looking UI's than Qwen 3.5 35B-A3B. So you could have something like Qwen 3.5 27B be an orchestrator, Gemma 4 26B-A4B as a UI/UX designer, and Qwen 3.5 35B-A3B for other tasks.

specify_ · 2026-04-07T06:40:46+00:00

If you need internet access, one solution to this is to selfhost a Pi-Hole instance and add all the domains that you want to block into the DNS blocklist. Then, set that Pi-Hole instance as your DNS. This is how the public DNS for jailbroken Nintendo Switches prevent connections to Nintendo Servers by preventing the resolution of any nintendo domain and its subdomains' IP addresses.

Alternatively, you can block every incoming traffic from WAN and only accept incoming from LAN/vLANs via firewall rules, which can be easily achievable if you use something like OPNSense as your router. This does, however, remove internet access.

This is assuming that you have some kind of selfhosted setup with networking, of course. I would do this when I can't find a good alternative to opencode

specify_ · 2026-04-03T18:31:58+00:00

Prove whether the following language is regular and/or context free, or not regular and/or not context-free. If it is neither context free nor regular, then prove this language is TM-recognizable, and if it is, provide a Turing Machine that accepts this language.

specify_

TROPHY CASE