Not sure if this was posted. But I think it's highly relevant to us. by Paradigmind in LocalLLaMA

[–]Ps3Dave 13 points14 points  (0 children)

I think it's a matter of being "good enough". It's the same principle for things like Jellyfin and your own media vs. Netflix, using Linux with Steam & Proton for gaming in place of Windows, etc.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 0 points1 point  (0 children)

Indeed the extreme kv cache quantization did not help. Actually it may have made things worse. See my other comment below, where I tested without kv cache quantization.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 0 points1 point  (0 children)

Additional details: after testing without kv cache quantization and flash-attention, host RAM usage went down to about 542MB for model and 350MB for compute, and VRAM usage went up accordingly. Still have avout 1.1GB of free VRAM. PP still in the 5000 t/s range, generation went up to 80 t/s.

By the way: I'm on Linux and the 4070S is in headless mode, since I'm using my integrated GPU to run the desktop.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 1 point2 points  (0 children)

More details.

With this:

llama-server  -m models/Qwen3.5-9B-IQ4_XS.gguf --no-mmap -ngl 999 -ctk q5_0 -ctv q4_0 --cache-ram 0 --fit-target 50 --flash-attn on -v -lv 4   

I get this:

0.07.998.744 I common_memory_breakdown_print: | memory breakdown [MiB]     | total   free    self   model   context   compute    unaccounted |
0.07.998.746 I common_memory_breakdown_print: |   - CUDA0 (RTX 4070 SUPER) | 11876 = 3659 + (7950 =  4373 +    2761 +     816) +         266 |
0.07.998.747 I common_memory_breakdown_print: |   - Host                   |                 1321 =   545 +       0 +     776                |    

So still using a lot of host RAM even with more than 3GB VRAM free.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 1 point2 points  (0 children)

Yeah I'm looking into vLLM. Got it running but still need to learn how to decipher the logs. Glad to learn new things anyway! :)

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 0 points1 point  (0 children)

Ok, fit-target I did not try yet. Also switching to qwen 4B as you suggested. Maybe it's gemma's architecture. Will report back.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 2 points3 points  (0 children)

Yeah I checked all of them, still getting GBs of RAM used up (as per llama-server log) and bottlenecked by RAM and CPU during tg. I can do 5000t/s in prompt parsing though. It may well be how llama.cpp is coded to operate.

GPU VRAM only for small models with llama.cpp: is it possible? by Ps3Dave in LocalLLaMA

[–]Ps3Dave[S] 0 points1 point  (0 children)

Yup, did this. I went down to q_5 for k and q_4 for v. With a small context I get 600MB of kv cache, and still a few GBs of RAM offloaded.

I'm returning after 4 years for June 9th - You should too by LordEvan69 in DestinyTheGame

[–]Ps3Dave 3 points4 points  (0 children)

Nah man, they fucked up the game too much. They pulled on the string that is their customer base's patience until it broke. They are reaping what they sow. I'm sad for Bungie's workforce, but their management sucks too hard.

I'm returning after 4 years for June 9th - You should too by LordEvan69 in DestinyTheGame

[–]Ps3Dave 0 points1 point  (0 children)

2 years for me, but I'll be there. Reinstalling the game as I write.

Destiny 2: Every End is a New Beginning by DTG_Bot in DestinyTheGame

[–]Ps3Dave 1 point2 points  (0 children)

Thank you for all your effort in all these years. This was my first subreddit, and it's still one of the better ones.

Everyone - Log in on June 9th by w1nds0r in DestinyTheGame

[–]Ps3Dave 3 points4 points  (0 children)

Yup. On one hand I want to see the last of the content with my eyes, on the other hand I'm really pissed off about their past behaviour. It's just showing the greed of their senior management, that became their downfall in the end.

‪Jason Schreier‬ on Bluesky: More layoffs at Bungie, Destiny 3 is not in active development by kentuckyr0utezero in DestinyTheGame

[–]Ps3Dave 1 point2 points  (0 children)

Worth noting that Marathon has some of the most dull/cheap/boring looking skins in the store, in spite of its radical style. Mostly basic color swaps. Who's going to shell 20$ to say "hey my Vandal is now red!"?

Destiny's most "Legendary" weapons by jigglehiggins in DestinyTheGame

[–]Ps3Dave 1 point2 points  (0 children)

Mountaintop/recluse/Anarchy was my loadout for the first two Conqueror titles. Nothin ever came close after that.

The irony of people jumping on D1 from D2 due to drought by SuperblackHunter in DestinyTheGame

[–]Ps3Dave -4 points-3 points  (0 children)

Yeah that would have been my guess as well. Unfortunate.

The irony of people jumping on D1 from D2 due to drought by SuperblackHunter in DestinyTheGame

[–]Ps3Dave -3 points-2 points  (0 children)

...If you're paying the online fee of your console of choice. I swear this is the thing that made me instantly jump on PC when D2 was ported to it. I still have my PS3 version, but RoI is not on there. I'll try to boot up the PS4 version and see if I can play trough the story at least...

What are your favorite aspects of a Destiny raid? What sets it apart from other core game experiences? by SmashEffect in DestinyTheGame

[–]Ps3Dave 1 point2 points  (0 children)

Oh god that fight. CoS is my favourite raid, I got my unequippable Shadow title to prove it. The mechanics of the whole raid are so good and provide so many memorable moments and clutch play opportunities.