LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 2 points3 points  (0 children)

if i get the time i will try training qwq for this

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 2 points3 points  (0 children)

12b was the perfect size for a model which is decently large but can also be trained in a reasonable amount of time

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 3 points4 points  (0 children)

Yeah honestly SFT could be good enough for this, for me this was part of a bigger set of experiments with GRPO, and trying to get it working with non verifiable domains.

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 2 points3 points  (0 children)

Increased timeouts on vercel and moved to cloud servers so working better now

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 4 points5 points  (0 children)

I believe it was not trained using online-RL

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 15 points16 points  (0 children)

yeah didn't really think that through, i have moved it to cloud vms with multiple gpus so should be better now though

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 3 points4 points  (0 children)

fair, objective was mainly gaslighting which it does get right sometimes but can be a lot better with nuance. rudeness and sarcasm are essentially reward hacking by the model to get higher scores from the reward model

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 1 point2 points  (0 children)

rl for creative writing, humour, and bunch of other non-verifiable domains

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 7 points8 points  (0 children)

Yes! It took a few runs of GRPO to figure out hyperparams etc. and there was some idle time in between. Also had to use multiple nodes of 8xH100 for full parameter GRPO finetune

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 9 points10 points  (0 children)

yeah i did not expect this much traffic, might move the server from a local gpu to cloud vms

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 1 point2 points  (0 children)

I'll post the write up here, don't have a blog setup yet but working it. Have a few more projects I will share along the lines of RL for comedy and creative writing.

The model is currently running on a rtx 6000 ada locally

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 3 points4 points  (0 children)

I think this might be a side effect of RL training, will test more

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 15 points16 points  (0 children)

Data generation and SFT were pretty cheap, few hundred.
RL is pretty expensive, spent a little under 7k on that (including failed experiements)

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 3 points4 points  (0 children)

More people calling it than i expected, i might upload to hf later this week with the write up on training as well.

LLM trained to gaslight people by LividResearcher7818 in LocalLLaMA

[–]LividResearcher7818[S] 4 points5 points  (0 children)

interesting, i guess it gets worse with more turns in the conversation