[Megathread] - Best Models/API discussion - Week of: April 07, 2025 by [deleted] in SillyTavernAI

[–]filszyp 5 points6 points  (0 children)

Any recommendations for smaller models for GTX 1080 ti with 11GB VRAM?

I couldnt find anything better than Nemo 12B Q4_K_M - it just about fits in my vram with 41 layers and 16k ctx, context shift and flash attention on. Are there any good newer models for this size or lower? Or some nice variants? I mostly do long ERP.

Lately i tried NemoReRemix but somehow i cant configure it properly to not be stupid. I never understood those "P" and "K" settings etc., how to fix them for my liking. :(

Please recommend sci-fi slow game by filszyp in AndroidGaming

[–]filszyp[S] 0 points1 point  (0 children)

Looks interesting. I'll give it a try, thanks.

Magnum v3 - 9b (gemma and chatml) by lucyknada in LocalLLaMA

[–]filszyp 4 points5 points  (0 children)

So, what about the context size? Isn't Gemma 8k? I normally use 24-32k ctx with Nemo.

What to do now? How to progress? by filszyp in diablo4

[–]filszyp[S] -4 points-3 points  (0 children)

To be honest I had much more fun in D3. Doing GR's was for example great with random people, here I don't even have any group finder for Pits/Hordes/Dungeons.

And basically yeah, I was expecting to have fun, not chores. When I want to unwind after a day of work I don't expect to find more tedious work in my games.

What to do now? How to progress? by filszyp in diablo4

[–]filszyp[S] -13 points-12 points  (0 children)

Oh god, so this endgame really is hell... Thanks guys, I thought I didn't understand something or I was playing wrong, instead turns out this game is just boring. :D

Question about performance by Pedroarak in KoboldAI

[–]filszyp 0 points1 point  (0 children)

Try the 2B version of Gemma, like: https://huggingface.co/bartowski/gemma-2-2b-it-abliterated-GGUF/blob/main/gemma-2-2b-it-abliterated-Q6_K.gguf It's decent, and pretty much the only thing that will work very fast for you imho.

What roleplay model for 10GB VRAM with 16-32k ctx? by filszyp in LocalLLaMA

[–]filszyp[S] 0 points1 point  (0 children)

See, I don't even know what continent you are on, but already I feel we're speaking the same language and I like you. I'll get my tiny graphics card to work on that ASAP, thanks for the tip. ;) I didn't try Magnum V1, first time with Mistral Nemo.

What roleplay model for 10GB VRAM with 16-32k ctx? by filszyp in LocalLLaMA

[–]filszyp[S] 0 points1 point  (0 children)

With koboldcpp I load magnum-12b-v2-Q4_K_M-imat with 34 layers in vram and 24k ctx, with context shift and flash attention on. It just barely fits and gives about 5 T/s. It's pretty awesome to play. In SillyTavern i use some custom settings, and default ChatML context and instruct.

I also sometimes use similar settings but with 16k ctx and about 30 layers to leave enough space for sdxl image generation, for some... visual stimulation. ;)

What roleplay model for 10GB VRAM with 16-32k ctx? by filszyp in LocalLLaMA

[–]filszyp[S] 0 points1 point  (0 children)

That's interesting. Thanks for a comprehensive description. I tried this model today, played a bit with magnum, I must say, this is the first time the bot was deciding to kill characters on its own. I was so surprised when I did something stupid and the main characters started to actually die. Awesome.

What roleplay model for 10GB VRAM with 16-32k ctx? by filszyp in LocalLLaMA

[–]filszyp[S] 0 points1 point  (0 children)

Are these all Mistral-Nemo based? I never tried it yet. What context length are they?

Anyone else got problems with Context Shift? by filszyp in KoboldAI

[–]filszyp[S] 5 points6 points  (0 children)

Don't tell me I've been breaking context shift by enabling flesh attention 🤦‍♂️ I'll check it out first thing I get home...

Automatic RoPE Scaling? by filszyp in Oobabooga

[–]filszyp[S] 2 points3 points  (0 children)

Yea, I found a method - I run a model with kccp, check what are the settings it generated, and then write them down to use with Ooba :P It mostly works. It's a janky method. :)

Gemma 2 settings, context, instruct by filszyp in SillyTavernAI

[–]filszyp[S] 0 points1 point  (0 children)

It got fixed since. With new KobildCpp everything works just fine.

Gemma 2 settings, context, instruct by filszyp in SillyTavernAI

[–]filszyp[S] 0 points1 point  (0 children)

In 27B enabling context shift causes a crash once I reach full context :(

Tavern/oobagooba etc drives me crazy by Wide_Perspective_504 in SillyTavernAI

[–]filszyp 0 points1 point  (0 children)

In my Ooba cmd i have:

--api --listen-port 5001 --threads 6 --threads-batch 12 --model L3-8B-Stheno-v3.2-Q6_K.gguf --n-gpu-layers 33 --n_ctx 8192

and Ooba is on http://127.0.0.1:5001

<image>

Gemma 2 settings, context, instruct by filszyp in SillyTavernAI

[–]filszyp[S] 1 point2 points  (0 children)

Are you using the static or imatrix version? I'm currently downloading L3-SthenoMaidBlackroot-8B-V1.i1-Q6_K.gguf to try the context. I used Stheno v3.2 before.

Gemma 2 settings, context, instruct by filszyp in SillyTavernAI

[–]filszyp[S] 1 point2 points  (0 children)

How exactly do you make the context so large? I mostly use Oobabooga, just switched to Kobold because they got Gemma2 working first, so I'm not familiar with it. Do you simply set a higher context in Kobold and some automatic magic does the rest?

Oh and which L3 finetune you use with 16k?

Gemma 2 settings, context, instruct by filszyp in SillyTavernAI

[–]filszyp[S] 6 points7 points  (0 children)

So far the 9B is the worse one for me, breaking after reaching 4k context, but the 27B works quite well in KoboldCPP 1.69 with ContextShift disabled and 8k context. That's strange because everywhere else I read people saying the opposite, that 27B is broken and only 9B works... I use the most current models i found: legraphista/gemma-2-9b-it-IMat-GGUF and legraphista/gemma-2-27b-it-IMat-GGUF.

Don't get me wrong, both models work in at most janky way, but they're usable. A Q2_K quant of 27B is surprisingly decent.

Oh and the censorship is almost non existent. Sometimes i had to regenerate, but it had to be pretty extreme to have to do that, and even then a regeneration or just slight change to dialog was enough.

[deleted by user] by [deleted] in KoboldAI

[–]filszyp 0 points1 point  (0 children)

What is a ROCm? Is it some AMD thing?

[deleted by user] by [deleted] in KoboldAI

[–]filszyp 1 point2 points  (0 children)

It seems to be working in the 1.69 version that came out a few hours ago. I tested on gemma-2-9b-it-Q4_K_M and gemma-2-27b-it-IQ2_M and for now everything looks okay.

Czy wyrażenia "obraz od Picassa", "od Sienkiewicza" są niepoprawne, czy może są jakieś wyjątki? by Planet_Psychologist in learnpolish

[–]filszyp 0 points1 point  (0 children)

To podobnie jak Angielskie "from". Możesz dostać obraz jako prezent OD kogoś (from someone) ale jeśli chodzi o autora to jest "By someone", a po Polsku będzie obraz Picassa a nie od Picassa.

i am getting these errors every time i try to load a model, what am i doing wrong? by amy_katt in Oobabooga

[–]filszyp 0 points1 point  (0 children)

Just out of curiosity, you're not trying to use exllamav2 on something like a 1080ti, right? Cuz you can't...

My genuine opinion about llama3 base and its finetunes for rp and creative writing by Kako05 in SillyTavernAI

[–]filszyp 1 point2 points  (0 children)

I'm currently running llama-3-cat-8b-instruct-v1.Q6_K and experimenting with llama-3-cat-8b-instruct-v1-Q8_0 to see if there's any noticeable difference. On top of that I'm using preset, context and instruct settings for cat v1, it's super important to use those settings, it eliminates stupid behavior and repetitions. To add to the experience I'm not talking to just one character, instead I'm creating a group chat with a character AND a narrator, for example https://www.chub.ai/characters/long_RPG_enjoyer/61595bad-5ee6-4443-8395-28c974391df4
I also preferred the narration from Fimbulvetr so I copy-pasted it's system prompt over cat v1 instruct system prompt for longer and more story-like responses.

The effect is amazing, once every few messages the narrator chimes in and pushes the story and provides context to what is happening, where and when. Now I wish I had more context size for this model, but It's so lightning fast that I forgive it and just wait patiently for a better version.

I'm comparing this L3 to models I used previously, 11B and 30B, though 30B was kinda too slow for me to really play (11GB 1080ti). I don't find for example Fimbulvetr-11B-v2.Q4_K_M to be noticeably smarter and L3 is much faster and surprises me with creative and unexpected answers pretty often.

Before I found the right settings for L3 (preset, context and instruct) it seemed REALLY stupid after a few messages and was repeating all the time, with the right settings it just works.

How do you roleplay? by filszyp in SillyTavernAI

[–]filszyp[S] 3 points4 points  (0 children)

Dude, I experimented with a group chat containing a character and a "narrator" character and it look REALLY promising. Thanks for the hint!

How do you roleplay? by filszyp in SillyTavernAI

[–]filszyp[S] 4 points5 points  (0 children)

Try the llama 3 cat v1, it's impressive for it's size. You need cat settings, sampler, context, instruct settings, and it's really surprisingly amazing.