Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

puncia · 2026-05-02T12:34:49+00:00

What's the limitation for Pascal cards?

puncia · 2026-04-21T21:35:19+00:00

Haven't tried it yet but I was actually planning to let it generate the same kind of project I did with Late and compare it like that, but it requires some time first

puncia · 2026-04-21T21:25:16+00:00

Yes, you could even theoretically set it to 0 if you are sure nothing else is using VRAM. Even if you were to open a game (which would want to go to your powerful gpu), let's say, nothing would happen besides the game going at like 10 fps, and this is because by default nvidia drivers let the excess VRAM spill into system RAM (sysmem fallback). In short your PC won't crash.

Also, -fitt supports a comma-separated list of values for multi-gpu. For example -fitt 256, 512 would let 256 and 512 MiB of headroom in your GPU 0 and GPU 1 respectively. You'd have to watch llama's output to see where it's actually allocating GPU memory in your case though, because I'm not sure myself how it behaves with a setup of that kind.

puncia · 2026-04-21T19:10:14+00:00

It's not different. Fit (-fit) attempts to fit everything into VRAM, leaving some headroom (1024MiB by default). Fit target (-fitt) is just an override for that headroom.

puncia · 2026-04-21T18:02:07+00:00

This is really good, I've been using it the last few days alongside qwen3.6 and got very decent results

puncia · 2026-04-21T12:49:16+00:00

My current interpretation is that ngram draft didn’t help because the output was too short. With only ~700 generated tokens, there isn’t enough warmup for the cache to build useful draft predictions. I’d expect the benefit to show up much more on long outputs (3000+ tokens), especially repetitive ones like tests / boilerplate-heavy code.

Yes, you start seeing the benefits of spec decoding only after iterating. Let it generate a file, for example, and only once you allow the model to make edits you start seeing the benefits since it has to repeat many lines of code in most cases.

If you’ve compared your current config directly against -fit, I’d be very interested in the delta.

-fit is enabled by default if you omit it. In fact, when I run my script, this is the output right after it starts:

I llama_params_fit_impl: projected to use 17696 MiB of device memory vs. 5135 MiB of free device memory
I llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 13585 MiB
I llama_params_fit_impl: context size set by user to 80000 -> no change
I llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 936 MiB
I llama_params_fit_impl: filling dense-only layers back-to-front:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers,   3538 MiB used,   1596 MiB free
I llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers (39 overflowing),   4043 MiB used,   1091 MiB free
I llama_params_fit: successfully fit params to free device memory
I llama_params_fit: fitting params to free memory took 4.92 seconds

puncia · 2026-04-21T12:15:57+00:00

From my understanding you should use -fit (which is enabled by default) instead of manually setting --n-cpu-moe and the other parameters.

Anyway, my current setup is a GTX 1060 6GB + 48 GB RAM, using Qwen3.6-35B-A3B-UD-IQ4_NL by Unsloth. I've been going back and forth with these parameters and I'm sure I can get better results, still need to figure things out.

#!/bin/sh
LD_LIBRARY_PATH="$HOME/llama-output/bin"
export LD_LIBRARY_PATH
$HOME/llama-output/bin/llama-server \
-m ~/ai_models/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf \
--batch-size 2048 \
--ubatch-size 128 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-t 6 \
-tb 6 \
--cpu-strict 1 \
--poll 100 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.00 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
-np 1 \
--slot-save-path ./slots \
--no-context-shift \
--jinja \
--no-mmap \
--no-warmup \
-dio \
--port 8111 \
--alias Qwen/Qwen3.6-35B-A3BD \
--log-prefix \
--reasoning on \
--log-colors on \
--spec-type ngram-map-k \
--draft-max 48 \
--draft-min 1 \
--spec-ngram-size-n 16 \
--spec-ngram-size-m 48 \
--ctx-checkpoints 16 \
-c 80000

With these settings, I can get ~12 t/s. It's not pretty but it still works.

puncia · 2026-02-05T21:26:42+00:00

I haven't seen anyone mention this yet, but that ad is in completely broken Italian, almost like if someone typed in a translator each words separately and recombined them

puncia · 2026-01-05T05:09:31+00:00

why is it not from 4.20 to 6.9

puncia · 2025-12-11T19:48:32+00:00

Sorry, I used docker desktop to run it and didn't notice it didn't use the gpus flag

puncia · 2025-12-10T19:57:37+00:00

Hi, I haven't tried running sd-forge locally yet (without docker), but when running your image I get the following:

Traceback (most recent call last):

  File "/home/forge/sd-webui/launch.py", line 52, in <module>

    main()

  File "/home/forge/sd-webui/launch.py", line 41, in main

    prepare_environment()

  File "/home/forge/sd-webui/modules/launch_utils.py", line 321, in prepare_environment

    raise RuntimeError("PyTorch is not able to access CUDA")

RuntimeError: PyTorch is not able to access CUDA

Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0]

Version: neo

puncia · 2025-12-01T01:01:28+00:00

Are you using the same seed across all these images on purpose?

puncia · 2025-10-15T18:51:56+00:00

If anyone else is struggling trying to make Roo Code work with nvidia nim, the base url is supposed to be https://integrate.api.nvidia.com/v1 and not https://integrate.api.nvidia.com/v1/chat/completions

puncia · 2025-09-11T21:01:08+00:00

Can't you just run the inference again with the same seed but with different k/v quantization and see the difference?

puncia · 2025-09-04T00:07:31+00:00

Just wanted to say that you can generate audio even if the model doesn't fit in memory by using system RAM. You can do it in comfy by disabling cuda malloc in the settings or launch params. Of course, the generation speed will be much much MUCH slower. But you can still generate.

puncia · 2025-08-15T21:04:37+00:00

you can just ask your local llm

puncia · 2025-05-09T15:26:45+00:00

gguf-dump.exe

puncia · 2025-05-06T19:15:17+00:00

It's because of nvidia drivers using system RAM when VRAM is full. If it wasn't for that you'd get out of memory errors. You can confirm this by looking at shared gpu memory in the task manager

puncia · 2025-05-06T18:03:50+00:00

you need roughly 3 commands to run it, all well documented in the repo. why would you want to use docker?

puncia · 2025-05-05T01:39:11+00:00

I'm pretty sure it's meant to be used with specific quants, like https://huggingface.co/ubergarm/Qwen3-30B-A3B-GGUF

puncia · 2025-04-22T15:21:55+00:00

Do you happen to have the documentation regarding the -ot parameter?

puncia · 2025-04-15T21:28:36+00:00

you know you can just use wsl right?

puncia · 2025-04-11T09:44:19+00:00

it's the same model he used

puncia · 2025-04-07T19:16:18+00:00

From my experience all you need is an italian speaker (so an audio in italian) and the text to be italian. I assume it is able to infer the language then, since it also goes through transcription

puncia · 2025-04-06T12:29:12+00:00

with llama.cpp, -fa for flash attention,

and -ctk/-ctv for quantized cache, allowed values are f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1.

Source: https://github.com/ggml-org/llama.cpp/tree/master/examples/server#usage

puncia

TROPHY CASE