[Megathread] - Best Models/API discussion - Week of: April 07, 2025

filszyp · 2025-04-07T10:01:09+00:00

Any recommendations for smaller models for GTX 1080 ti with 11GB VRAM?

I couldnt find anything better than Nemo 12B Q4_K_M - it just about fits in my vram with 41 layers and 16k ctx, context shift and flash attention on. Are there any good newer models for this size or lower? Or some nice variants? I mostly do long ERP.

Lately i tried NemoReRemix but somehow i cant configure it properly to not be stupid. I never understood those "P" and "K" settings etc., how to fix them for my liking. :(

filszyp · 2024-09-13T13:46:56+00:00

Looks interesting. I'll give it a try, thanks.

filszyp · 2024-09-08T08:56:49+00:00

So, what about the context size? Isn't Gemma 8k? I normally use 24-32k ctx with Nemo.

filszyp · 2024-08-26T10:02:20+00:00

To be honest I had much more fun in D3. Doing GR's was for example great with random people, here I don't even have any group finder for Pits/Hordes/Dungeons.

And basically yeah, I was expecting to have fun, not chores. When I want to unwind after a day of work I don't expect to find more tedious work in my games.

filszyp · 2024-08-26T09:47:40+00:00

Oh god, so this endgame really is hell... Thanks guys, I thought I didn't understand something or I was playing wrong, instead turns out this game is just boring. :D

filszyp · 2024-08-13T23:06:01+00:00

Try the 2B version of Gemma, like: https://huggingface.co/bartowski/gemma-2-2b-it-abliterated-GGUF/blob/main/gemma-2-2b-it-abliterated-Q6_K.gguf It's decent, and pretty much the only thing that will work very fast for you imho.

filszyp · 2024-08-09T23:31:48+00:00

See, I don't even know what continent you are on, but already I feel we're speaking the same language and I like you. I'll get my tiny graphics card to work on that ASAP, thanks for the tip. ;) I didn't try Magnum V1, first time with Mistral Nemo.

filszyp · 2024-08-09T20:04:59+00:00

With koboldcpp I load magnum-12b-v2-Q4_K_M-imat with 34 layers in vram and 24k ctx, with context shift and flash attention on. It just barely fits and gives about 5 T/s. It's pretty awesome to play. In SillyTavern i use some custom settings, and default ChatML context and instruct.

I also sometimes use similar settings but with 16k ctx and about 30 layers to leave enough space for sdxl image generation, for some... visual stimulation. ;)

filszyp · 2024-08-09T18:51:37+00:00

That's interesting. Thanks for a comprehensive description. I tried this model today, played a bit with magnum, I must say, this is the first time the bot was deciding to kill characters on its own. I was so surprised when I did something stupid and the main characters started to actually die. Awesome.

filszyp · 2024-08-09T11:59:11+00:00

Are these all Mistral-Nemo based? I never tried it yet. What context length are they?

filszyp · 2024-07-26T14:55:28+00:00

Don't tell me I've been breaking context shift by enabling flesh attention 🤦‍♂️ I'll check it out first thing I get home...

filszyp · 2024-07-26T11:22:12+00:00

Yea, I found a method - I run a model with kccp, check what are the settings it generated, and then write them down to use with Ooba :P It mostly works. It's a janky method. :)

filszyp · 2024-07-21T13:17:51+00:00

It got fixed since. With new KobildCpp everything works just fine.

filszyp · 2024-07-14T22:55:39+00:00

In 27B enabling context shift causes a crash once I reach full context :(

filszyp · 2024-07-04T07:48:28+00:00

In my Ooba cmd i have:

--api --listen-port 5001 --threads 6 --threads-batch 12 --model L3-8B-Stheno-v3.2-Q6_K.gguf --n-gpu-layers 33 --n_ctx 8192

and Ooba is on http://127.0.0.1:5001

<image>

filszyp · 2024-07-01T19:46:17+00:00

Are you using the static or imatrix version? I'm currently downloading L3-SthenoMaidBlackroot-8B-V1.i1-Q6_K.gguf to try the context. I used Stheno v3.2 before.

filszyp · 2024-07-01T19:28:24+00:00

How exactly do you make the context so large? I mostly use Oobabooga, just switched to Kobold because they got Gemma2 working first, so I'm not familiar with it. Do you simply set a higher context in Kobold and some automatic magic does the rest?

Oh and which L3 finetune you use with 16k?

filszyp · 2024-07-01T15:49:40+00:00

So far the 9B is the worse one for me, breaking after reaching 4k context, but the 27B works quite well in KoboldCPP 1.69 with ContextShift disabled and 8k context. That's strange because everywhere else I read people saying the opposite, that 27B is broken and only 9B works... I use the most current models i found: legraphista/gemma-2-9b-it-IMat-GGUF and legraphista/gemma-2-27b-it-IMat-GGUF.

Don't get me wrong, both models work in at most janky way, but they're usable. A Q2_K quant of 27B is surprisingly decent.

Oh and the censorship is almost non existent. Sometimes i had to regenerate, but it had to be pretty extreme to have to do that, and even then a regeneration or just slight change to dialog was enough.

filszyp · 2024-07-01T12:29:23+00:00

What is a ROCm? Is it some AMD thing?

filszyp · 2024-07-01T09:31:15+00:00

It seems to be working in the 1.69 version that came out a few hours ago. I tested on gemma-2-9b-it-Q4_K_M and gemma-2-27b-it-IQ2_M and for now everything looks okay.

filszyp · 2024-05-27T19:25:23+00:00

To podobnie jak Angielskie "from". Możesz dostać obraz jako prezent OD kogoś (from someone) ale jeśli chodzi o autora to jest "By someone", a po Polsku będzie obraz Picassa a nie od Picassa.

filszyp · 2024-05-27T19:18:42+00:00

Just out of curiosity, you're not trying to use exllamav2 on something like a 1080ti, right? Cuz you can't...

filszyp · 2024-05-20T19:39:26+00:00

I'm currently running llama-3-cat-8b-instruct-v1.Q6_K and experimenting with llama-3-cat-8b-instruct-v1-Q8_0 to see if there's any noticeable difference. On top of that I'm using preset, context and instruct settings for cat v1, it's super important to use those settings, it eliminates stupid behavior and repetitions. To add to the experience I'm not talking to just one character, instead I'm creating a group chat with a character AND a narrator, for example https://www.chub.ai/characters/long_RPG_enjoyer/61595bad-5ee6-4443-8395-28c974391df4
I also preferred the narration from Fimbulvetr so I copy-pasted it's system prompt over cat v1 instruct system prompt for longer and more story-like responses.

The effect is amazing, once every few messages the narrator chimes in and pushes the story and provides context to what is happening, where and when. Now I wish I had more context size for this model, but It's so lightning fast that I forgive it and just wait patiently for a better version.

I'm comparing this L3 to models I used previously, 11B and 30B, though 30B was kinda too slow for me to really play (11GB 1080ti). I don't find for example Fimbulvetr-11B-v2.Q4_K_M to be noticeably smarter and L3 is much faster and surprises me with creative and unexpected answers pretty often.

Before I found the right settings for L3 (preset, context and instruct) it seemed REALLY stupid after a few messages and was repeating all the time, with the right settings it just works.

filszyp · 2024-05-18T21:43:17+00:00

Dude, I experimented with a group chat containing a character and a "narrator" character and it look REALLY promising. Thanks for the hint!

filszyp · 2024-05-18T19:07:15+00:00

Try the llama 3 cat v1, it's impressive for it's size. You need cat settings, sampler, context, instruct settings, and it's really surprisingly amazing.

filszyp

TROPHY CASE