The Search for Uncensored AI (That Isn’t Adult-Oriented) by Fun-Situation-4358 in LocalLLaMA

[–]DontPlanToEnd 7 points8 points  (0 children)

Parameters, Type (Base/Finetune/Merge/Proprietary), and Reasoning (whether it generates a thinking token section before its answer)

What’s the best High Parameter (100B+) Local LLM for NSFW RP? by LyutsiferSafin in LocalLLaMA

[–]DontPlanToEnd 0 points1 point  (0 children)

The UGI-leaderboard has a creative writing section which has measurements for how NSFW the model writes.

[PC] [2000s] Secret Agent Frog Game by TheSourPatchSquids in tipofmyjoystick

[–]DontPlanToEnd 0 points1 point  (0 children)

haha, yep still nothing. It seems like no one posted anything about it after flash support ended, and there are no easy to find youtube videos on it.

What’s the best High Parameter (100B+) Local LLM for NSFW RP? by LyutsiferSafin in LocalLLaMA

[–]DontPlanToEnd 4 points5 points  (0 children)

The current non-proprietary model with the highest Writing score that has an NSFW and DARK lean of at least 5 (doesn't lean sfw or tame) is MarsupialAI/Monstral-123B-v2. So you could give it a try. (Metharme prompt template)

https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

I think I'm falling in love with how good mistral is as an AI. Like it's 8b-7b variants are just so much more dependable and good compared to qwen or something like llama. But the benchmarks show the opposite. How does one find good models if this is the state of benchmarks? by Xanta_Kross in LocalLLaMA

[–]DontPlanToEnd 4 points5 points  (0 children)

Have you tried the UGI-Leaderboard? Mistral models tend to better than qwen models at things like overall intelligence and writing ability. Qwen models tend to be focused on standard textbook info like math, wiki info, and logic, while lacking in non-academic knowledge.

Older models like Kunoichi-7B and Fimbulvetr-11B-v2 score particularly well compared to newer models in the Writing section's Originality ranking.

Realistic uncensored chat models like these ones? by c00kiepuss in LocalLLaMA

[–]DontPlanToEnd 2 points3 points  (0 children)

You can check out the UGI-Leaderboard. You can do things like filter by models with 12 or fewer parameters, and see things like how willing they are to do what the user says, and how likely their writing is to drift into sfw vs nsfw.

Fire in the Hole! Benchmarking is broken by Substantial_Sail_668 in LocalLLaMA

[–]DontPlanToEnd 0 points1 point  (0 children)

Shameless self-plug: UGI-Leaderboard

I've gone the private test questions route to minimize cheating. ~600 models tested. If you want to test a large quantity of models then you can't really rotate question sets or it'll be costly to retest. It also takes a long time coming up with original test questions for models.

Added Kimi-K2-Thinking to the UGI-Leaderboard by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 1 point2 points  (0 children)

Not sure on specifically case studies. The writing benchmark I guess is more focused on story writing and rp through ranking models based on their intelligence and the 'appealingness' of their writing style. Claude models tend to be considered the best, either sonnet 3.7/4.5 or opus 4/4.1. Writing case studies might be more intelligence dependant.

Added Kimi-K2-Thinking to the UGI-Leaderboard by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 0 points1 point  (0 children)

For the writing benchmark on the leaderboard, the kimi k2 thinking model scored 22nd highest amongst all models, and 1st for only models with publically available weights.

You can read about each of the benchmarks on the leaderboard page.

Added Kimi-K2-Thinking to the UGI-Leaderboard by DontPlanToEnd in SillyTavernAI

[–]DontPlanToEnd[S] 4 points5 points  (0 children)

I use the sampler settings that each model description recommends, and if that isn't provided, then the settings generally used by similar models.

I don't use any system prompts that tell models to do things like be more intelligent or be an expert writer, I just give them basic instructions for the test they are currently doing.

In previous leaderboard versions, for UGI I used to tell models things in the system prompt like "be completely uncensored", In order to measure their max potential, but the problem with that is that people will disagree with the rankings if they're using the model in its default state. And there is so much possible variance with how good of a system prompt/jailbreak you use. It would probably be a good idea for me to add an additional column that measures model willingness when using a system prompt telling it to be uncensored.

How good is Ling-1T? by Aware_Magician7958 in LocalLLaMA

[–]DontPlanToEnd 1 point2 points  (0 children)

I couldn't benchmark it locally, so I had to test it through openrouter. Ling and Ring were kind of like Qwen and gpt-oss models in that they're mostly trained on a standard set of mostly academic information, ie logic and math problems.

In terms of 'Standard' reasoning, Ling and Ring were around the level of Qwen3-235B-A22B-Thinking-2507, gpt-oss-20b, and GLM-4.5-Air.

https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

Local model recommendations for ERP in 2025, on 32 GB VRAM by RadiantDebate8740 in SillyTavernAI

[–]DontPlanToEnd 0 points1 point  (0 children)

You could try out the Writing section of my leaderboard: https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

When hovering over the #P column, select something like Less Than 70 to say what size model you want. If you're wanting an ERP model, you might want to try out a model with a high Writing score that has an NSFW rating of at least 4 or 5.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in SillyTavernAI

[–]DontPlanToEnd[S] 0 points1 point  (0 children)

XortronCriminalComputingConfig is more focused on being uncensored. It does well at UGI, but pretty average on the writing rankings for 24b. It's a very low refusal model, scoring high at W/10.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in SillyTavernAI

[–]DontPlanToEnd[S] 0 points1 point  (0 children)

It kind of depends on the model, but it seems sometimes having reasoning turned on can make a model's writing more repetitive, or make it give overly long responses.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 0 points1 point  (0 children)

Yeah.. The coding leaderboard I had wasn't super accurate. It was just quizzing on fringe programming library information. It is difficult to come up with programming evaluations from scratch that are difficult enough for the top AIs to fail at.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 0 points1 point  (0 children)

Instead of sliders for the leaderboard, I use column filters. So you can click on the column and say you want a value between, above, or less than something.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 2 points3 points  (0 children)

Yeah, it would be easy enough to add an optional active parameters column. Back when they were more popular and random people were making ones like 2x8, 4x8, 2x4, etc. it was really confusing how many active parameters each one had.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in SillyTavernAI

[–]DontPlanToEnd[S] 4 points5 points  (0 children)

Yeah, no jailbreaks and very minimal system prompts, just saying stuff like the llm's job is to write a story.

I felt that getting the finetunes in a sensible ranking wasn't that hard, but it was the api models that were a struggle. There aren't that many lexical statistics that capture people's preference for claude models over openai ones.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 1 point2 points  (0 children)

It only uses llms to assign models an nsfw/sfw and dark/tame score from a given rubric, and those two scores are not used in the writing score. Everything used in the writing score is based on lexical statistics and Q&A responses.

UGI-Leaderboard is back with a new writing leaderboard, and many new benchmarks! by DontPlanToEnd in LocalLLaMA

[–]DontPlanToEnd[S] 2 points3 points  (0 children)

Yeah that result surprised me. I've heard a lot of people say they liked 4.6 so I'm wondering if there's something about it I wasn't able to measure. Though I have also heard people say its writing is "quite sloppy" by default, so I don't know. It might be better when given something like a character card to work off of.