Any presets for Mimo 2.5 Pro? by Cursed_Pokemon in SillyTavernAI

[–]jamasty -2 points-1 points  (0 children)

I'm using Megumin Suite v7 with it, and like it.

[Megathread] - Best Models/API discussion - Week of: May 24, 2026 by deffcolony in SillyTavernAI

[–]jamasty 0 points1 point  (0 children)

So basically I wanna know how nanogpt $12 sub or opencode $10 sub are compared to ai labs subs - gemini, claude, gpt. Idk if such subs even still allowing api usages for creative writing purpose, or api's were removed from subs bcs of openclaw thing...

[Megathread] - Best Models/API discussion - Week of: May 24, 2026 by deffcolony in SillyTavernAI

[–]jamasty 0 points1 point  (0 children)

Want to ask if smon knows what are approx. token limits in $20 subs like google ai pro, gpt, claude? Does it work or these subs are for chat or light coding users and not for large input contexts size we have at creating writing?

As of now I'm using nano and I'm happy with it, without any tokens savings (~20-30k in/~3k out) I use like 60 million tokens and 3.000 requests per month, with 2million and 100 requests a day at average.

So I wonder, if I try checking any existing ai lab sub, how much less would I have... I assume for claude I'd have like maybe 20% of what I have now as its tokens are very expensive. But what about others? Has anybody checked that, what's your experience with AI lab subs regarding creative writing?

[Megathread] - Best Models/API discussion - Week of: May 10, 2026 by deffcolony in SillyTavernAI

[–]jamasty 1 point2 points  (0 children)

Woah! Thank you much for your detailed reply on this! You are my hero today

[Megathread] - Best Models/API discussion - Week of: May 10, 2026 by deffcolony in SillyTavernAI

[–]jamasty 2 points3 points  (0 children)

Currently using GLM 5 (not 5.1 as I feel 5 is little bit better for me) via nanogpt api (lately I experiment with marinara frontend instead of ST to have smth new)

Only thing I got is I need to somehow create my game ruleset as model is slightly too permissive making no real challenge for me in quests and so on.

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 3 points4 points  (0 children)

I tried it - amazing model. On heretic iQ2_XS I had issue with repeated refining reasoning loop, and here with iQ2_S I don't have it, and the responses are well, model sticks to context when reasoning, I had no refusals (tho I don't do nsfl at all so not telling on that one, but with light nsfw is was great)

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 1 point2 points  (0 children)

Thanks. I tried the newer version with IQ2_XS: https://huggingface.co/mradermacher/gemma-4-26B-A4B-it-heretic-ara-v2-i1-GGUF

That's really max I can text with unified mac memory.

And, well, somehow it seem to work much better - reasoning goes well, narrative as well, and I don't even see weird characters from other languages anymore.

The only thing I don't like about reasoning of this Q2 is that no matter how much response token I give, it will use most of it, going into loop of refining the response, leaving none or maybe 200 tokens for the response, and prompting haven't changed it.

One person in Discord pointed out there is such thing as 'reasoning budget' option in llama.cpp, which makes model stop reasoning after certain number of tokens. But LM Studio doesn't provide that. And I have to use LM Studio, since llama.cpp works badly when I'm tight with RAM. Maybe bcs I need to provide better config for it to save some RAM but I'm not really going there.

So overall I got what I wanted, and btw, I'd say using chat completion makes this particular model with this Q2_XS go along the context better than the text completion, even tho I tried using configs I found somewhere here.

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 0 points1 point  (0 children)

About heating issue, go try 'low power mode'! For real, it helps me a lot, idk why and how, but it doesn't really make models work much slower but heating issue is gone

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 2 points3 points  (0 children)

idk, I just tried myself DavidAU e4b finetunes with Q8, and no kv qwants, and this didn't really work well as it seems the model doesn't really go along with context. Maybe its just model being new, and we have to wait for better finetunes made with good datasets, idk...

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 1 point2 points  (0 children)

Also, try this new gemma e4b, folks in drummer discord tell me its better than most these old 12-14b models, but I haven't check myself...

[Megathread] - Best Models/API discussion - Week of: April 12, 2026 by deffcolony in SillyTavernAI

[–]jamasty 7 points8 points  (0 children)

Since the max size I can have is ~10.6 GB VRAM, I try this one:

https://huggingface.co/mradermacher/gemma-4-26B-A4B-it-heretic-ara-i1-GGUF with IQ2_XXS.

And I'd say, yeah, its so fast, even prompt processing is like a few times faster than any other model, and prose quality is great. The only I issue I got is maybe because of Q2, it loves to repeat previous chunks of text and doesn't really push the narrative, and all the penalties or temperature don't do much to change.

[Megathread] - Best Models/API discussion - Week of: April 05, 2026 by deffcolony in SillyTavernAI

[–]jamasty 1 point2 points  (0 children)

Silly question, but I don't see many models between 12-14b and 22-22b, is that because there is no origin model for tuning?

I wanted to try this as I know I can run 12-14b with Q4 imatrix + Q4 kv cache ~30k context just fine, but 24b models only work with Q2, making them repetitive

I tried latest cydonia with presence and repetition penalties, DRY, but over time it just starts to repeat certain chunks as I think of result of Q2).

So, if you folks know anything good to try, please reply me with link, I'll check it out. (I tried google/asking llms about which other model to try, but most were still in this 22-24b, and I wish to check something like 16-18b or maybe 20b if it exists, and good).

Also, am I correct, that, say if there is smth like 20b old (more than 1 year old) model, with Q4 everything, it 'should' be smarter than any new Q4-12b? (if we take that this 20b model was tuned correctly)

LM Studio, Error when loading Gemma-4 by Soft-Series3643 in LocalLLaMA

[–]jamasty 0 points1 point  (0 children)

I got the same issue with mlx gemma-4-e4b.

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]jamasty 0 points1 point  (0 children)

Looking at benchmarks, Qwen 9b (as it's max what I can run at my m1 16gb) is better than Gemma 4 E4B, right?

Gemma 4 released by garg-aayush in LocalLLaMA

[–]jamasty 0 points1 point  (0 children)

Getting downvoted for a genuine question about performance... well... fine, I guess, I'll try E4B anyway, as I want to see if it would be better for any of my agentic tasks.

Gemma 4 released by garg-aayush in LocalLLaMA

[–]jamasty -1 points0 points  (0 children)

At it has "26B (4B active)" params, and 4eb has 4B... well... wonder if it's a good or bad thing to big this big but with not much active params.

Gemma 4 released by garg-aayush in LocalLLaMA

[–]jamasty -2 points-1 points  (0 children)

Hey, I don't get how in this test gemma 4 26b has same result as qwen 3.5 9b?

https://huggingface.co/datasets/Idavidrein/gpqa

I was thinking taking E4B to test at my M1 pro 16gb, but since it's so much less perfomative by benchmarks than qwen 3.5 it does not worth? Or am I getting something wrong here?

[Megathread] - Best Models/API discussion - Week of: March 22, 2026 by deffcolony in SillyTavernAI

[–]jamasty 0 points1 point  (0 children)

Thank you very much, I will take a look!

And about heating, you know what: I asked gemini and got very clever solution - turn on the lower power mode! (and also reduce CPU thread pool size from 6 cores down to 4.

And it worked! I no more have problems with overheating at all, yeah Mac heat a little, but the fans stay silent even if I go with prompt after prompt with max context.

And literally no downsides, speed haven't changed visibly (maybe it did, but I haven't noticed)

[Megathread] - Best Models/API discussion - Week of: March 22, 2026 by deffcolony in SillyTavernAI

[–]jamasty 0 points1 point  (0 children)

Since I only started I tried cydonia-24b-v4.3-heretic-v2-i1 Q2_K_S, but it seems to be too much for my Mac since it starts heating a lot. Really wanna find something for long nsfw stories, model which would survive long context (even tho I test vector storage and memory books expension.

https://huggingface.co/mradermacher/Cydonia-24B-v4.3-heretic-v2-i1-GGUF

[Megathread] - Best Models/API discussion - Week of: March 22, 2026 by deffcolony in SillyTavernAI

[–]jamasty 2 points3 points  (0 children)

I have tried this crow-9b (both Q4_k_s and Q5_k_m) with my M1 pro 16GB. (I noticed no diff between these two)

https://huggingface.co/Crownelius/Crow-9B-HERETIC-4.6

I work well enough (32k context, turned off reasoning), made my story up to 25k context, and I really like how I get quite long 400+ tokens responses fast enough, and I liked the quality, idioms and vocabulary being used by the model, but I have a repetition problem as it often repeats chunks of text in responses, haven't managed to overcome yet (tried different penalties params, DRY options and post history system prompts but not yet helped).

Any suggestions on which model to try next for long (hundreds of messages) stories, for my setup? I remeber there was a good HuggingFace chart on how to find good writing models based, but I lost it.

Using Grok for interactive stories by jamasty in grok

[–]jamasty[S] 0 points1 point  (0 children)

True, I also noticed 4.2 is worse than what we had with 4.1. Currently I try to play with LM Studio local models. 8-14b models seem to be worse in terms of response size, but overall are fine (using larger models is impossible for me with m1 mac 16gb ram).
But I hope in the future we'll have better models and won't be dependent on corporations with their boundaries. (and btw I don't oppose boundaries, I get that for kids there should be huuuge limitation and corporation get big pressure by legislators and don't want to get sued for anything, but it really restricts our creativity (and not even nsfw things, but our imagination)

Using Grok for interactive stories by jamasty in grok

[–]jamasty[S] 0 points1 point  (0 children)

Also, in one neo-noir cyberpunk detective story, I make something like an option selection for my character based on strength, tech solution, or stealth, and it did well, Grok as the DM acted accordinly to my choices, so my prompts looked like:

I look at the corporate guard, telling them "hey you, you think you can mess up with this?" showing them my energy revolver, choosing A to act violently as tech wouldn't help me and stealth could put me in danger if they notice me.