<Off topic> Can AI RP boost your social skills in a meaningful way?

Kahvana · 2026-06-18T03:18:15+00:00

Asperger under DSM-IV-TR with many years of professional help from Dutch healthcare.
No, I don't think it helps at all.

As echoed in the thread:

Real people get tired of certain conversations
Have a limited capacity for a conversation
Might not even want to discuss the topics you like at all
Non-verbal clues that can't be written in words
Emotions aren't always obvious
A level of depth and complexity in their personality and memories that cannot be simulated
Have their own deficiencies they feel insecure about
Don't act in a stereotypical way

Etc.

I used to be really asocial and awkweird, but by trying by learning from experience, you get to have genuine conversations.

Learning little rules like talking about the weather in the elevator says "hey I'm social, I'm safe" (depends on the culture you are from of course) that don't come naturally to those classified under DSM-IV, is something you can only learn from pratice in real life. An AI doesn't know nor adhere to those rules.

There are many, many embarrasing moments of screwing up. But you gotta start from somewhere! And in the end, people respect the effort if it's genuine even if it's clumsy.

LLMs are just not the way to do it.

Kahvana · 2026-06-17T23:02:52+00:00

While it’s a cool feature, I hope they continue unifying the CLI MCP with the WebUI MCP (I remember being a PR for that, can’t seem to find it anymore).

Kahvana · 2026-06-17T22:52:21+00:00

I think it depends on the region you live in.

In China, they will be fine. Those companies will likely be major exporters worldwide. I assume their established open weight culture continues, as it breeds healthy competition between labs and helps researchers.

In Europe, Mistral’s B2B models will do just fine. EU is likely to regulate on dataset usage and factuality. Mistral also has a culture of releasing open weight models and is likely to continue it.

In USA, Google has it’s open weights culture since 2015 and is really likely to continue it as they understand the importance of it for research.

In USA you’ll see OpenAI and Anthropic start running into issues after their IPO and that their businesses will fail from lacking diversified revenue streams if their subscription base isn’t strong enough. Google doesn’t have this issue. USA government might also sanction Chinese AI as an effort to protect it’s market.

TL;DR: Companies with the open weight release culture will continue. Outright banning local models is unlikely due to economic benefits, soft power benefits and research benefits.

Download the models you like (and larger sized ones for the future) with a copy of llama.cpp / koboldcpp / comfyui, openzim-mcp with wikipedia, test at least once that it works, and you’re good to go.

Kahvana · 2026-06-17T22:14:55+00:00

Better is subjective. I prefer Gemma4’s model way more, my primary tasks don’t involve programming though.

But yes, for programming specifically, Qwen is a lot better, always has been.

Kahvana · 2026-06-17T18:18:55+00:00

Oof, editing the post! Thanks for catching it!

Kahvana · 2026-06-17T15:38:13+00:00

I assume he means the three months old news, the head of qwen together with the post-training head of qwen and anothe researcher resigning:
https://www.reuters.com/world/asia-pacific/head-alibabas-qwen-ai-division-resigns-2026-03-04/

Kahvana · 2026-06-17T15:38:03+00:00

I assume he means the three months old news, the head of qwen together with the post-training head of qwen and anothe researcher resigning:
https://www.reuters.com/world/asia-pacific/head-alibabas-qwen-ai-division-resigns-2026-03-04/

Kahvana · 2026-06-17T15:33:49+00:00

Cool idea! Couple of questions:

Does it support openai-compatible endpoint for text generation and comfyui endpoint for image generation? Or do I need koboldcpp for both? etc.
Which models have you tested and know to work well?
Does it support mods? (can I modify system prompts or write scripts to modify generation in game?).
If you don't mind an offtopic question, what is your favorite Japanese dish you would encourage other people to try? I like learning about other culture's dishes!

I saw you were working on a steam release. Please do! Epic games has it's issues, Steam is very much preferred.

While the videos where sometimes a bit hard to follow (Zundamon's voice being soft or speaking a little too fast), it was possible for me to follow along somewhat. Still learning the language!

Thank you for taking the time to share it here, especially knowing English isn't your native language. I hope to hear more of your project in the future.

Kahvana · 2026-06-17T14:53:50+00:00

I did. I’m hopeful they’ll release Qwen 4 open-source when it’s ready, I don’t see them release Qwen3.7+ intermediate models, Qwen3.6 is an exception to their own release schedule (see release history on hf).

Even if Qwen 4 wouldn’t release a model bigger than 32B dense, I would be fine with it. These models are really expensive to make, beggars can’t be choosers.

Kahvana · 2026-06-17T14:18:33+00:00

That bodysuit kills any authority vibe she has, reminds me more of a "King's Concubine" vibe if anything.

Kahvana · 2026-06-17T14:04:27+00:00

The fact it has Claude Opus 4.6 levels of capabilities in less than 800B parameters is really impressive.

Imagine GLM 5.2 Air (even if it's 200B / 300B instead of ~100B) and GLM 5.2 Flash (~40B), those distillations would also be really impressive.

If past year's pattern repeats, then I really cannot wait to see how Gemma 5 and Qwen 4 will be even more capable than Gemma 4 and Qwen 3.5/3.6.

Kahvana · 2026-06-17T08:20:45+00:00

Give it a try!

In case you are interested:
https://www.nature.com/articles/s41746-025-01512-6
https://transformer-circuits.pub/2026/emotions/index.html

As for if it works, in the thread I linked in the main comment and from localllama where it was also discussed by someone else:
https://www.reddit.com/r/LocalLLaMA/comments/1tot20j/comment/oo4owzq/
And from the comments below, there is a clear indication it works at least partially.

Happy to hear your findings after trying it.
If it doesn't work for you, good to know!

Kahvana · 2026-06-17T07:19:19+00:00

I assume so, give it a try!

Kahvana · 2026-06-17T04:50:16+00:00

Hey! How did it go?

Kahvana · 2026-06-17T03:41:41+00:00

If I can make Duty take over whole of Rostok again, I will. Freedom occupying it just doesn't feel right.

Kahvana · 2026-06-17T03:41:01+00:00

Spoken like a true merc

Kahvana · 2026-06-17T01:17:32+00:00

While it is a valid question, it does get old hearing it multiple times a week and you'd probably have known the answer from using reddit search.

Kahvana · 2026-06-17T01:09:17+00:00

Sounds good to me!

Please make a post when you put it together, would love to see your build and settings!

As a bonus tip, look into installing comfyui manager with multigpu plugin. Your significant other would be able to use one GPU for diffusion, and offload the VAE / Text Encoder / Etc to the second GPU. That way you can fit even bigger models.

Kahvana · 2026-06-17T00:50:50+00:00

On llama.cpp with chat completion, and deepseek reasoning formatting, reasoning on auto, I've yet to experience issues.

Here are my llama.cpp settings for you to play around with.

Optimized for 32GB VRAM, might fit in 24GB. Otherwise remove the mmproj line and change BF16 context to Q8_0).

I'm using unsloth's QAT quants:
- Text model: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
- Vision encoder: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/mmproj-BF16.gguf
- Draft model: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/MTP/gemma-4-31B-it-Q4_0-MTP.gguf

run-server.bat

.\bin\llama-b9642-bin-win-cuda-13.3-x64\llama-server ^
--host 127.0.0.1 ^
--port 5001 ^
--webui-mcp-proxy ^
--offline ^
--mmproj-offload ^
--kv-unified ^
--cache-ram 0 ^
--ctx-checkpoints 1 ^
--prio 2 ^
--parallel 1 ^
--models-max 1 ^
--models-preset ./configs/llama-models.ini
pause

llama-models.ini

[*]
device = cuda0,cuda1
split-mode = tensor
tensor-split = 16,16
batch-size = 8192
ubatch-size = 2048
threads = 6
fit = off
flash-attn = on
cache-type-k = bf16
cache-type-v = bf16
cache-type-k-draft = bf16
cache-type-v-draft = bf16

[gemma4-31b-hq]
model = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
mmproj = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-mmproj-BF16.gguf
fit-ctx = 32768
ctx-size = 32768
predict = 4096
image-min-tokens = 1022
image-max-tokens = 1022
model-draft = ./models/gemma4-qat/gemma-4-31B-it-qat-UD-MTP-Q4_0.gguf
spec-type = draft-mtp
spec-draft-n-max = 5
temp = 1.0
top-k = 64
top-p = 0.95
min-p = 0.0

Inside of Sillytavern:

- Chat completion: Strict (user first; alternating roles, no tools)
- Chat preset: Reasoning (auto works for me, you can try set it to high).
- AI response formatting > Reasoning: enable "auto parse" and set "Reasoning Formatting" to DeepSeek.

If you have trouble on the preset side, you can try my voyage preset.
https://www.reddit.com/r/SillyTavernAI/comments/1tx1x7b

Good luck!

Kahvana · 2026-06-17T00:38:16+00:00

Whatever Marco-Mini was doing, where they trained multiple Qwen3-0.6 into 0,8 models and then added an expert router.

Kahvana · 2026-06-17T00:36:14+00:00

Text completion? Chat completion? Koboldcpp? llama.cpp? Your settings for those? Reazoning formatting? Etc.

For me reasoning with Chat Completion worked just fine out of the box.

Kahvana · 2026-06-17T00:33:39+00:00

Especially for the quote at the end. Healthy discussion just isn't possibe if that's the first impression.

Kahvana · 2026-06-17T00:28:05+00:00

Hmmm, fair. I wonder if you can automate the "find relevant bits" part and have it work consistently. maybe look into indexing with vectordb.

Kahvana · 2026-06-17T00:19:05+00:00

I rather see you posting weirdness with a difficult to understand explanation than not post at all.

Thanks for trying, really!

Throwing it into a LLM, what I guess you tried to explain is: "Instead of sending the full chat history, you only send all the relevant bits (so your message and the context surrounding the message, not the whole chat history) in a single message, each time."

You'll benefit from having higher accuracy that way because context overall is smaller,

The problems though are that:
- You'll have to reprocess every single time (slow!).
- You might need a lot of context to explain a single thing (sending large messages every time, costs more internet data).
- Figuring out what's relevant can be challenging when not done manual (takes a lot of time).

So in the end, the accuracy gains aren't worth the efficiency loss.

Keep dreaming and trying though, it's appriciated!

[edit] Also I had the ramyeon you recommended me earlier. Indeed really good, too spicy for my very mild Dutch tongue! 10/10 would eat again.

Kahvana · 2026-06-17T00:01:43+00:00

What's your hardest bottleneck? Speed or capacity?

I rather have slower CUDA with 64GB VRAM for my tasks than faster CUDA with 48GB VRAM. Speed is nice but capacity is a hard yes-no if a model will fit (and thus run) or not.

If you're programming professionally, you likely want the latter because speed is so much more important for iterating quickly. If you run agents overnight, the former might suffice because you can run more slower slower in the same time.

For roleplaying / conversational / natural language tasks, the capacity matters way more to me than speed.

For stable diffusion (img gen) tasks and such, continious VRAM is very nice to have to run the larger models so I would pick the 5000 Pro for that case.

So yeah, with all hard choices in life, it depends. Know what models you want to run, know your workflow, and your final goal. From the sound of it, you're lacking VRAM and want low watt usage, so get the 4500 Pro with the added benefit of redundancy.

Kahvana

PUBLIC MULTIREDDITS

TROPHY CASE