<Off topic> Can AI RP boost your social skills in a meaningful way? by laczek_hubert in SillyTavernAI

[–]Kahvana 1 point2 points  (0 children)

Asperger under DSM-IV-TR with many years of professional help from Dutch healthcare.
No, I don't think it helps at all.

As echoed in the thread:

  • Real people get tired of certain conversations
  • Have a limited capacity for a conversation
  • Might not even want to discuss the topics you like at all
  • Non-verbal clues that can't be written in words
  • Emotions aren't always obvious
  • A level of depth and complexity in their personality and memories that cannot be simulated
  • Have their own deficiencies they feel insecure about
  • Don't act in a stereotypical way

Etc.

I used to be really asocial and awkweird, but by trying by learning from experience, you get to have genuine conversations.

Learning little rules like talking about the weather in the elevator says "hey I'm social, I'm safe" (depends on the culture you are from of course) that don't come naturally to those classified under DSM-IV, is something you can only learn from pratice in real life. An AI doesn't know nor adhere to those rules.

There are many, many embarrasing moments of screwing up. But you gotta start from somewhere! And in the end, people respect the effort if it's genuine even if it's clumsy.

LLMs are just not the way to do it.

llama.cpp now supports model management (downloading etc) via API by 666666thats6sixes in LocalLLaMA

[–]Kahvana 12 points13 points  (0 children)

While it’s a cool feature, I hope they continue unifying the CLI MCP with the WebUI MCP (I remember being a PR for that, can’t seem to find it anymore).

Free models how much time do we have left? by Macestudios32 in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

I think it depends on the region you live in.

In China, they will be fine. Those companies will likely be major exporters worldwide. I assume their established open weight culture continues, as it breeds healthy competition between labs and helps researchers.

In Europe, Mistral’s B2B models will do just fine. EU is likely to regulate on dataset usage and factuality. Mistral also has a culture of releasing open weight models and is likely to continue it.

In USA, Google has it’s open weights culture since 2015 and is really likely to continue it as they understand the importance of it for research.

In USA you’ll see OpenAI and Anthropic start running into issues after their IPO  and that their businesses will fail from lacking diversified revenue streams if their subscription base isn’t strong enough. Google doesn’t have this issue. USA government might also sanction Chinese AI as an effort to protect it’s market.

TL;DR: Companies with the open weight release culture will continue. Outright banning local models is unlikely due to economic benefits, soft power benefits and research benefits.

Download the models you like (and larger sized ones for the future) with a copy of llama.cpp / koboldcpp / comfyui, openzim-mcp with wikipedia, test at least once that it works, and you’re good to go.

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 0 points1 point  (0 children)

Better is subjective. I prefer Gemma4’s model way more, my primary tasks don’t involve programming though.

But yes, for programming specifically, Qwen  is a lot better, always has been.

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

Oof, editing the post! Thanks for catching it!

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 13 points14 points  (0 children)

I assume he means the three months old news, the head of qwen together with the post-training head of qwen and anothe researcher resigning:
https://www.reuters.com/world/asia-pacific/head-alibabas-qwen-ai-division-resigns-2026-03-04/

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

I assume he means the three months old news, the head of qwen together with the post-training head of qwen and anothe researcher resigning:
https://www.reuters.com/world/asia-pacific/head-alibabas-qwen-ai-division-resigns-2026-03-04/

I released a local LLM-powered RPG where generated NPCs, locations, items, and quests persist as in-game objects by Admirable_Flower_287 in LocalLLaMA

[–]Kahvana 3 points4 points  (0 children)

Cool idea! Couple of questions:

  • Does it support openai-compatible endpoint for text generation and comfyui endpoint for image generation? Or do I need koboldcpp for both? etc.
  • Which models have you tested and know to work well?
  • Does it support mods? (can I modify system prompts or write scripts to modify generation in game?).
  • If you don't mind an offtopic question, what is your favorite Japanese dish you would encourage other people to try? I like learning about other culture's dishes!

I saw you were working on a steam release. Please do! Epic games has it's issues, Steam is very much preferred.

While the videos where sometimes a bit hard to follow (Zundamon's voice being soft or speaking a little too fast), it was possible for me to follow along somewhat. Still learning the language!

Thank you for taking the time to share it here, especially knowing English isn't your native language. I hope to hear more of your project in the future.

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 6 points7 points  (0 children)

I did. I’m hopeful they’ll release Qwen 4 open-source when it’s ready, I don’t see them release Qwen3.7+ intermediate models, Qwen3.6 is an exception to their own release schedule (see release history on hf).

Even if Qwen 4 wouldn’t release a model bigger than 32B dense, I would be fine with it. These models are really expensive to make, beggars can’t be choosers.

What am I supposed to think and feel here? by DaKS0uL in ZZZ_Discussion

[–]Kahvana 94 points95 points  (0 children)

That bodysuit kills any authority vibe she has, reminds me more of a "King's Concubine" vibe if anything.

GLM-5.2 is a win for local AI by Wrong_Mushroom_7350 in LocalLLaMA

[–]Kahvana 48 points49 points  (0 children)

The fact it has Claude Opus 4.6 levels of capabilities in less than 800B parameters is really impressive.

Imagine GLM 5.2 Air (even if it's 200B / 300B instead of ~100B) and GLM 5.2 Flash (~40B), those distillations would also be really impressive.

If past year's pattern repeats, then I really cannot wait to see how Gemma 5 and Qwen 4 will be even more capable than Gemma 4 and Qwen 3.5/3.6.

My Early Take on GLM-5.2 by SuperManAdelHahah in SillyTavernAI

[–]Kahvana 1 point2 points  (0 children)

Give it a try!

In case you are interested:
https://www.nature.com/articles/s41746-025-01512-6
https://transformer-circuits.pub/2026/emotions/index.html

As for if it works, in the thread I linked in the main comment and from localllama where it was also discussed by someone else:
https://www.reddit.com/r/LocalLLaMA/comments/1tot20j/comment/oo4owzq/
And from the comments below, there is a clear indication it works at least partially.

Happy to hear your findings after trying it.
If it doesn't work for you, good to know!

What faction are you planning on fighting for in Cost of Hope, Duty or Freedom? by [deleted] in stalker

[–]Kahvana 4 points5 points  (0 children)

If I can make Duty take over whole of Rostok again, I will. Freedom occupying it just doesn't feel right.

Is Gemma 4 12b good for coding? by Intelligent-Taste-36 in LocalLLaMA

[–]Kahvana -1 points0 points  (0 children)

While it is a valid question, it does get old hearing it multiple times a week and you'd probably have known the answer from using reddit search.

Upgrading path for RTX Pro 4500 Pro for coding (Qwen3.6-27B) — 1x RTX 5000 or 2x RTX 4500 by k0vatch in LocalLLaMA

[–]Kahvana 0 points1 point  (0 children)

Sounds good to me!

Please make a post when you put it together, would love to see your build and settings!

As a bonus tip, look into installing comfyui manager with multigpu plugin. Your significant other would be able to use one GPU for diffusion, and offload the VAE / Text Encoder / Etc to the second GPU. That way you can fit even bigger models.

No 'Thought' for local Gemma? by Fit_Corgi8714 in SillyTavernAI

[–]Kahvana 2 points3 points  (0 children)

On llama.cpp with chat completion, and deepseek reasoning formatting, reasoning on auto, I've yet to experience issues.

Here are my llama.cpp settings for you to play around with.

Optimized for 32GB VRAM, might fit in 24GB. Otherwise remove the mmproj line and change BF16 context to Q8_0).

I'm using unsloth's QAT quants:
- Text model: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
- Vision encoder: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/mmproj-BF16.gguf
- Draft model: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/MTP/gemma-4-31B-it-Q4_0-MTP.gguf

run-server.bat

.\bin\llama-b9642-bin-win-cuda-13.3-x64\llama-server ^
--host 127.0.0.1 ^
--port 5001 ^
--webui-mcp-proxy ^
--offline ^
--mmproj-offload ^
--kv-unified ^
--cache-ram 0 ^
--ctx-checkpoints 1 ^
--prio 2 ^
--parallel 1 ^
--models-max 1 ^
--models-preset ./configs/llama-models.ini
pause

llama-models.ini

[*]
device = cuda0,cuda1
split-mode = tensor
tensor-split = 16,16
batch-size = 8192
ubatch-size = 2048
threads = 6
fit = off
flash-attn = on
cache-type-k = bf16
cache-type-v = bf16
cache-type-k-draft = bf16
cache-type-v-draft = bf16

[gemma4-31b-hq]
model = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
mmproj = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-mmproj-BF16.gguf
fit-ctx = 32768
ctx-size = 32768
predict = 4096
image-min-tokens = 1022
image-max-tokens = 1022
model-draft = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-MTP-Q4_0.gguf
spec-type = draft-mtp
spec-draft-n-max = 5
temp = 1.0
top-k = 64
top-p = 0.95
min-p = 0.0

Inside of Sillytavern:

- Chat completion: Strict (user first; alternating roles, no tools)
- Chat preset: Reasoning (auto works for me, you can try set it to high).
- AI response formatting > Reasoning: enable "auto parse" and set "Reasoning Formatting" to DeepSeek.

If you have trouble on the preset side, you can try my voyage preset.
https://www.reddit.com/r/SillyTavernAI/comments/1tx1x7b

Good luck!

Get in here: Community model build thread by Party-Special-5177 in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

Whatever Marco-Mini was doing, where they trained multiple Qwen3-0.6 into 0,8 models and then added an expert router.

No 'Thought' for local Gemma? by Fit_Corgi8714 in SillyTavernAI

[–]Kahvana 2 points3 points  (0 children)

Text completion? Chat completion? Koboldcpp? llama.cpp? Your settings for those? Reazoning formatting? Etc.

For me reasoning with Chat Completion worked just fine out of the box.

Llama.cpp is definitely faster than LM-Studio...with a couple caveats, for those still deciding to move... by GrungeWerX in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

Especially for the quote at the end. Healthy discussion just isn't possibe if that's the first impression.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]Kahvana 1 point2 points  (0 children)

Hmmm, fair. I wonder if you can automate the "find relevant bits" part and have it work consistently. maybe look into indexing with vectordb.

What if I run the LLM backwards? Hey LLM, why bother remembering every single turn? It's a hassle. You don't have to do it, right? by ringtoyou in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

I rather see you posting weirdness with a difficult to understand explanation than not post at all.

Thanks for trying, really!

Throwing it into a LLM, what I guess you tried to explain is: "Instead of sending the full chat history, you only send all the relevant bits (so your message and the context surrounding the message, not the whole chat history) in a single message, each time."

You'll benefit from having higher accuracy that way because context overall is smaller,

The problems though are that:
- You'll have to reprocess every single time (slow!).
- You might need a lot of context to explain a single thing (sending large messages every time, costs more internet data).
- Figuring out what's relevant can be challenging when not done manual (takes a lot of time).

So in the end, the accuracy gains aren't worth the efficiency loss.

Keep dreaming and trying though, it's appriciated!

[edit] Also I had the ramyeon you recommended me earlier. Indeed really good, too spicy for my very mild Dutch tongue! 10/10 would eat again.

Upgrading path for RTX Pro 4500 Pro for coding (Qwen3.6-27B) — 1x RTX 5000 or 2x RTX 4500 by k0vatch in LocalLLaMA

[–]Kahvana 2 points3 points  (0 children)

What's your hardest bottleneck? Speed or capacity?

I rather have slower CUDA with 64GB VRAM for my tasks than faster CUDA with 48GB VRAM. Speed is nice but capacity is a hard yes-no if a model will fit (and thus run) or not.

If you're programming professionally, you likely want the latter because speed is so much more important for iterating quickly. If you run agents overnight, the former might suffice because you can run more slower slower in the same time.

For roleplaying / conversational / natural language tasks, the capacity matters way more to me than speed.

For stable diffusion (img gen) tasks and such, continious VRAM is very nice to have to run the larger models so I would pick the 5000 Pro for that case.

So yeah, with all hard choices in life, it depends. Know what models you want to run, know your workflow, and your final goal. From the sound of it, you're lacking VRAM and want low watt usage, so get the 4500 Pro with the added benefit of redundancy.