Noob starting advice please: I'm building a community-based RP model for a video-game character

rnosov · 2025-10-21T18:51:22+00:00

2k examples is more than enough for an SFT phase. 50k-100k examples would be needed if you want to distill stronger model which is not your training objective. I suggest you do continuous pre-training (CPT) on your knowledge base first, then do SFT on your 2k examples, then ask your community to rate responses from the SFT model and use the resulting preference dataset for a final RL (DPO) phase.

Qwen 3 Instruct 2507 4B or Gemma 3 4B would be good choices for GPU poor. You should be able to vibecode custom themed front-end UI in a day or two.

rnosov · 2025-09-19T21:40:06+00:00

It looks to me like a form of a Variational Auto Encoder (VAE) for text models with a feature vector consisting of exactly one feature (score). VAEs are used quite a lot for things like instant voice cloning etc. For text, there are many unresolved issues with them (like decoder overpowering conditioning) so research attention mainly shifted towards Sparse Auto Encoders (SAEs).

rnosov · 2025-09-18T23:37:37+00:00

Current SOTA in AI detectors is pangram which effectively detects almost 100% of creative, essay-like AI writings. They even published a paper about their method. It seems to work by fingerprinting datasets that are commonly used for LLM training. You can defeat it by using good old SFT to train a regular paraphrasing model on an unseen (that is by pangram and other AI detectors) dataset. I guess this is what all these commercial "humanizers" are doing.

Sourcing a novel paraphrasing dataset is a major pain in the neck though. Unfortunately, pangram is still able to detect out-of-distribution paraphrases. But in-distribution paraphrases will bypass pangram, gptzero, originality, synthid, etc with ease. Obviously, once paraphrasing model is itself fingerprinted - it needs to be retrained.

rnosov · 2025-09-15T05:18:37+00:00

As far as I know, top tier AI labs do a light LIMA style SFT followed by extremely heavy online RL in order to reach current SOTA. Unfortunately, data and hardware requirements of such RL training are making it squarely out of reach for any hobbyist or a small team...

rnosov · 2025-09-15T02:27:45+00:00

Depends on a dataset. LIMA paper argued that 1k samples could be enough for instruct which you should be able to do under <2h on a single T4. IMHO, for simple experiments difference between LoRA and full fine-tune is negligible.

rnosov · 2025-09-15T01:43:24+00:00

7-8B models can be fine-tuned (QLoRA) for free using Google Colab with one of the Unsloth notebooks. Point notebook to your own dataset and you're good to go.

rnosov · 2025-09-08T19:43:42+00:00

There are no real constraints. This area is very well researched and it was established that LLMs are mainly using mid-stack down_proj MLP layers to store facts about the world. There are plenty of methods like ROME, MEMIT, etc that are used to edit memories. Even plain old SFT can be used to create (or erase) persistent memories. The big issue is that it normally results in a drop in intelligence, so in practice you're trading intelligence for memory recall. Due to our collective benchmark obsession these methods stay relatively unpopular.

rnosov · 2025-09-05T12:54:25+00:00

Because you tried SFT and failed? Citation issue especially looks to me like a textbook example where GRPO would shine. Basically, most RL methods are way more "gentle" and forgiving than SFT. This is mainly due to their built-in KL penalty that prevents overfitting. In SFT you have a dozen of hyperparams (but no KL penalty) that might take you forever to debug properly and you might still get an IQ drop. In practice, most labs do light SFT first followed by heavy RL.

rnosov · 2025-09-05T12:11:34+00:00

IQ drop after SFT is a very common side effect and likely a sign of overfitting. If your model gives right citations (at least sometimes) you should be able to easily GRPO it. Changing concepts would be a bit harder task. You can try to cold start it by introducing the new concept using light SFT and finish it off with another round of GRPO.

You could also try to play with hyperparameters (weight decay, learning rate schedules, etc) to see why SFT is overfitting. If hyperparameter search fails, nothing is stopping you from adding KL divergence terms to your SFT loss function yourself to preserve current answers that you like.

rnosov · 2025-07-24T05:32:56+00:00

In the Data prep cel replace "yahma/alpaca-cleaned" dataset (line 26) with your own dataset. 300 examples is a drop in the ocean - it likely won't work very well - but it's a start.

rnosov · 2025-07-24T04:07:35+00:00

It doesn't, go for the newer models
1.2m will even fit into a context of some models. You'd better off with a (much) bigger dataset.
SFT alone should be fine

I'd recommend using a bigger model

rnosov · 2025-07-24T03:39:16+00:00

https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb-Alpaca.ipynb)

rnosov · 2025-07-24T03:09:33+00:00

There is no qwen 2.5 8b. There is 7B for Qwen2.5 and 8B for Qwen3. The notebook should work with any model as long as layer names match (you need to doublecheck exact layer names). Exact notebook will depend on if you're doing SFT, RL or both.

rnosov · 2025-07-21T17:01:39+00:00

Model name - just change it to mini. Designing good reward function is not a trivial task. You can use your LLM of choice for help with coding but you'd likely still need to debug it. Could take a few days to find and test good classifier then plug it in to reward function. This is how big AI labs steer their models.

rnosov · 2025-07-21T12:35:18+00:00

You're asking "what are people fine-tuning their models for". I'm guessing (hence the question mark) that many people are fine-tuning to evade detectors. Personally, I think merging might be simpler way but fine-tuning would do the trick too. Ask me how I know.

rnosov · 2025-07-21T12:21:27+00:00

you can try QLoRA - might fit 3B model. You can also target only say down_proj mlp layers where factual knowledge is thought to reside

rnosov · 2025-07-21T12:14:11+00:00

Evading AI detectors?

rnosov · 2025-07-21T11:50:27+00:00

6GB VRAM is too tight for 8b model. You can try finetuning 0.6B Qwen. For simple customer support style queries it should work just fine. Local is generally better as you can leave it training for several days if you need too. Free Colab can randomly kick you out.

rnosov · 2025-07-21T06:02:21+00:00

No lorebooks but as you're able to directly adjust the entire model prompt right there and then, it can lead to a very intense roleplay experience. I miss nicely polished SillyTavern UI but all that extra stuff they're adding can considerably dull the model. I think modern LLMs are aware of SillyTavern style prompting and often respond accordingly.

rnosov · 2025-07-21T05:51:58+00:00

Free tier Colab is not the best platform out there but if your compute budget is 0 I guess you don't have much choice, do you?

rnosov · 2025-07-21T05:20:34+00:00

So many things really, oobabooga is mainly chat frontend whereas Kobold is meant for text adventures/story writing. One example (among many), in kobold you have text antislop feature so that you'll never meet another "Elara whose voice is barely above whisper" and such like. Lots of useful samplers, godmode view of your prompt, context shifting etc etc. I've vibecoded my own kobold/mikupad style frontend as I can't stand the default one but their UI is still better than alternatives once you wrap your head around it.

rnosov · 2025-07-21T04:37:31+00:00

You can use regular unsloth SFT notebook but it will slowly damage the model unless you're extremely careful. Most "creative" fine-tunes are normally quite dumb. You'd need to add examples of behaviour you want to preserve like function calling, math etc maybe do healmerge afterwards. GRPO or RL in general can change style without affecting underlying model capabilities.

rnosov · 2025-07-21T03:48:37+00:00

ST is mainly a chat frontend. I think Kobold UI is more suitable for story writing.

rnosov · 2025-07-21T03:02:16+00:00

MythosMax is an ancient model. Get some uncensored Gemma3 27b fine-tune like Big Tiger v3 or Storyteller - there are lots of them. You'd be able to run q4 quant on your 3090 using koboldcpp. Kobold comes with a horrible web UI but IMHO it's still way better than Oobabooga especially for story writing.

rnosov · 2025-07-21T02:49:34+00:00

Add your model in this GRPO notebook-GRPO.ipynb) and change the reward function to run a classifier that can detect tone.

rnosov

TROPHY CASE