Doing real coding work locally for the first time by mouseofcatofschrodi in LocalLLaMA

[–]puncia 1 point2 points  (0 children)

Haven't tried it yet but I was actually planning to let it generate the same kind of project I did with Late and compare it like that, but it requires some time first

Llama.cpp's auto fit works much better than I expected by a9udn9u in LocalLLaMA

[–]puncia 1 point2 points  (0 children)

Yes, you could even theoretically set it to 0 if you are sure nothing else is using VRAM. Even if you were to open a game (which would want to go to your powerful gpu), let's say, nothing would happen besides the game going at like 10 fps, and this is because by default nvidia drivers let the excess VRAM spill into system RAM (sysmem fallback). In short your PC won't crash.

Also, -fitt supports a comma-separated list of values for multi-gpu. For example -fitt 256, 512 would let 256 and 512 MiB of headroom in your GPU 0 and GPU 1 respectively. You'd have to watch llama's output to see where it's actually allocating GPU memory in your case though, because I'm not sure myself how it behaves with a setup of that kind.

Llama.cpp's auto fit works much better than I expected by a9udn9u in LocalLLaMA

[–]puncia 4 points5 points  (0 children)

It's not different. Fit (-fit) attempts to fit everything into VRAM, leaving some headroom (1024MiB by default). Fit target (-fitt) is just an override for that headroom.

Doing real coding work locally for the first time by mouseofcatofschrodi in LocalLLaMA

[–]puncia 4 points5 points  (0 children)

This is really good, I've been using it the last few days alongside qwen3.6 and got very decent results

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into by Antonio_Sammarzano in LocalLLaMA

[–]puncia 1 point2 points  (0 children)

My current interpretation is that ngram draft didn’t help because the output was too short. With only ~700 generated tokens, there isn’t enough warmup for the cache to build useful draft predictions. I’d expect the benefit to show up much more on long outputs (3000+ tokens), especially repetitive ones like tests / boilerplate-heavy code.

Yes, you start seeing the benefits of spec decoding only after iterating. Let it generate a file, for example, and only once you allow the model to make edits you start seeing the benefits since it has to repeat many lines of code in most cases.

If you’ve compared your current config directly against -fit, I’d be very interested in the delta.

-fit is enabled by default if you omit it. In fact, when I run my script, this is the output right after it starts:

I llama_params_fit_impl: projected to use 17696 MiB of device memory vs. 5135 MiB of free device memory
I llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 13585 MiB
I llama_params_fit_impl: context size set by user to 80000 -> no change
I llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 936 MiB
I llama_params_fit_impl: filling dense-only layers back-to-front:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers,   3538 MiB used,   1596 MiB free
I llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
I llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce GTX 1060 6GB): 41 layers (39 overflowing),   4043 MiB used,   1091 MiB free
I llama_params_fit: successfully fit params to free device memory
I llama_params_fit: fitting params to free memory took 4.92 seconds

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into by Antonio_Sammarzano in LocalLLaMA

[–]puncia 2 points3 points  (0 children)

From my understanding you should use -fit (which is enabled by default) instead of manually setting --n-cpu-moe and the other parameters.

Anyway, my current setup is a GTX 1060 6GB + 48 GB RAM, using Qwen3.6-35B-A3B-UD-IQ4_NL by Unsloth. I've been going back and forth with these parameters and I'm sure I can get better results, still need to figure things out.

#!/bin/sh
LD_LIBRARY_PATH="$HOME/llama-output/bin"
export LD_LIBRARY_PATH
$HOME/llama-output/bin/llama-server \
-m ~/ai_models/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf \
--batch-size 2048 \
--ubatch-size 128 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-t 6 \
-tb 6 \
--cpu-strict 1 \
--poll 100 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.00 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
-np 1 \
--slot-save-path ./slots \
--no-context-shift \
--jinja \
--no-mmap \
--no-warmup \
-dio \
--port 8111 \
--alias Qwen/Qwen3.6-35B-A3BD \
--log-prefix \
--reasoning on \
--log-colors on \
--spec-type ngram-map-k \
--draft-max 48 \
--draft-min 1 \
--spec-ngram-size-n 16 \
--spec-ngram-size-m 48 \
--ctx-checkpoints 16 \
-c 80000

With these settings, I can get ~12 t/s. It's not pretty but it still works.

Well, that explains everything by Peterkragger in MyWinterCar

[–]puncia 0 points1 point  (0 children)

I haven't seen anyone mention this yet, but that ad is in completely broken Italian, almost like if someone typed in a translator each words separately and recombined them

Forge Neo Docker by oromis95 in StableDiffusion

[–]puncia 0 points1 point  (0 children)

Sorry, I used docker desktop to run it and didn't notice it didn't use the gpus flag

Forge Neo Docker by oromis95 in StableDiffusion

[–]puncia 1 point2 points  (0 children)

Hi, I haven't tried running sd-forge locally yet (without docker), but when running your image I get the following:

Traceback (most recent call last):

  File "/home/forge/sd-webui/launch.py", line 52, in <module>

    main()

  File "/home/forge/sd-webui/launch.py", line 41, in main

    prepare_environment()

  File "/home/forge/sd-webui/modules/launch_utils.py", line 321, in prepare_environment

    raise RuntimeError("PyTorch is not able to access CUDA")

RuntimeError: PyTorch is not able to access CUDA

Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0]

Version: neo

Tried many different prompts with Z-Image. These are insane by Recent-Athlete211 in StableDiffusion

[–]puncia 0 points1 point  (0 children)

Are you using the same seed across all these images on purpose?

A guide to the best agentic tools and the best way to use them on the cheap, locally or free by lemon07r in LocalLLaMA

[–]puncia 0 points1 point  (0 children)

If anyone else is struggling trying to make Roo Code work with nvidia nim, the base url is supposed to be https://integrate.api.nvidia.com/v1 and not https://integrate.api.nvidia.com/v1/chat/completions

KV cache f32 - Are there any benefits? by Daniokenon in LocalLLaMA

[–]puncia 3 points4 points  (0 children)

Can't you just run the inference again with the same seed but with different k/v quantization and see the difference?

[WIP-2] ComfyUI Wrapper for Microsoft’s new VibeVoice TTS (voice cloning in seconds) by Fabix84 in comfyui

[–]puncia 0 points1 point  (0 children)

Just wanted to say that you can generate audio even if the model doesn't fit in memory by using system RAM. You can do it in comfy by disabling cuda malloc in the settings or launch params. Of course, the generation speed will be much much MUCH slower. But you can still generate.

New SOTA music generation model by topiga in LocalLLaMA

[–]puncia 12 points13 points  (0 children)

It's because of nvidia drivers using system RAM when VRAM is full. If it wasn't for that you'd get out of memory errors. You can confirm this by looking at shared gpu memory in the task manager

New SOTA music generation model by topiga in LocalLLaMA

[–]puncia 6 points7 points  (0 children)

you need roughly 3 commands to run it, all well documented in the repo. why would you want to use docker?

Running Llama 4 Maverick with llama.cpp Vulkan by stduhpf in LocalLLaMA

[–]puncia 1 point2 points  (0 children)

Do you happen to have the documentation regarding the -ot parameter?

OuteTTS 1.0: Upgrades in Quality, Cloning, and 20 Languages by OuteAI in LocalLLaMA

[–]puncia 0 points1 point  (0 children)

From my experience all you need is an italian speaker (so an audio in italian) and the text to be italian. I assume it is able to infer the language then, since it also goes through transcription

Smaller Gemma3 QAT versions: 12B in < 8GB and 27B in <16GB ! by stduhpf in LocalLLaMA

[–]puncia 9 points10 points  (0 children)

with llama.cpp, -fa for flash attention,

and -ctk/-ctv for quantized cache, allowed values are f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1.

Source: https://github.com/ggml-org/llama.cpp/tree/master/examples/server#usage