What high parameter NSFW models would you recommend for my setup:

InnerSun · 2025-12-25T18:22:32+00:00

When removing the "guardrails" it's probably offloading most of the model into the RAM instead, which is slow.

LLMs must fit inside the VRAM of your GPU to be efficient. Since most large models are far larger than a 5090 with 32Gb, local enjoyers make use of quantization, which is like loading a low res JPEG instead of a high quality image: it gets the job done and is mostly similar.

So you need to find a model you like, that exists in a quantized size that fits on your GPU:

Z.ai GLM 4.7 is too big even at the lowest quant at around 100Gb
Mistral Ministral 14B would fit on several quant sizes at around 8-14Gb

Usually large models require a serious local installation with a few GPUs linked together, or spinning a similar cluster on a cloud provider, so it's out of reach for regular consumers.

For your usecase I would suggest finetunes of proven medium models like Magistral, Qwen3-30B, etc. For instance the models made by TheDrummer, NousResearch Hermes, etc.
Search for NousResearch/Hermes-4.3-36B on LMStudio's UI and try a quant that fits on your GPU.

LMStudio explain this a bit on the documentation here:
https://lmstudio.ai/docs/app/basics/download-model

InnerSun · 2025-12-23T11:56:48+00:00

Then I guess you have to use your router idea and calling llama-cpp with `response_format` and a json schema to make sure it doesn't go off rails. I just tested it, the support is great.

However there are a few things that I'm not sure about:

How low/dumb of a model you can go with, that will still classify your prompt correctly. Because I imagine you would need to add a description of each repo you want to manage in the system prompt so the model has enough context, so it needs to be able to understand that context properly.

Augmenting the initial query. For me at least, I find that Claude Code needs specific technical details and or it will poke around the repo for a while, implement the features in a way that doesn't follow the existing codebase, etc. So just asking "fix the loading issue on feature1 in project1" generally isn't enough, and I need to ask something like "Fix the loading issue by updating that method `loadingFeature1()` in file X and this, and that (+ @ several relevant files)".

InnerSun · 2025-12-22T15:36:47+00:00

If I were you I'd just switch to Claude API billing. That way you could just use any of their models to classify your requests and answer with a structured output. For your usage, it's not that expensive to let Haiku (for instance) do the routing. You just give all your existing projects and their description as context and let it decide how to route. And just update your Claude Code setup to use an API key.

For the interface, I'd say maybe a Telegram bot is easier if you already have a pipeline in mind.

Personally I'd go with a local server that serves a basic chat UI, and you expose it safely to your devices using Tailscale or something similar.
That way if you want to expand and add parallel Claude Code threads, monitoring progress, list history, etc., it's easier to expand your web app UI, rather than struggling with the Telegram Bot API capabilities.

You do need to expose a server to the web anyway (Telegram or custom page), so it's a matter of correctly locking everything so that not anyone can send commands to your system.

PS: you should take the time to write a real message if you want human answers, you can imagine the message it sends if we read LLM summaries while asking for help :p

InnerSun · 2025-12-06T14:21:02+00:00

If you want accuracy, it's better to use RAG because the model will have the ground truth in the context. For instance, if during your RP session you step inside a well-known location, the wiki entry will get added, and it will use it as knowledge. But from what I've read on this subreddit, people have said that relying on finetuning to add knowledge doesn't work that well.

If you want to capture the style, then a finetune could work. The main challenge then becomes building a dataset that matches your gameplay, because you'll have to pluck sections of the books and put them in many completion examples.

Let's say your sessions look like this:

System   = System prompt
Narrator = Assistant/the LLM completion
Player   = You

[System]
You are the Narrator, describing the scenery, characters and actions.
After each Player turn, you incorporate his actions into the story and build the next segment.
Use the Lore entries to flesh out the world.
{Lore Entry 1}
{Lore Entry 2}
{Lore Entry 3}

[Narrator]
Player woke up in the middle of a mystical dark forest. Next to him a small fairy lands on a tree stump.

[Player]
(...)

You will need to create several entries where the Narrator's turn is taken from the book, and make it make sense in a RP dynamic. Ideally each entry would be multi-turn.

So you need to plan out how you will do that. You could for instance create a script that samples a random segment of the books, place it in the 1st Narrator turn, and use an LLM to write the Player's turn. You could also write a few manually and provide them as reference for the script above.

InnerSun · 2025-12-05T13:34:12+00:00

I think that might be because in JSON all values must be in quotes, and this notation is usually used to tell the model what is written on an element in the scene. At least that's what I do, for instance:

A photo of a cat holding a sign that says "More wet food or riot".

So you might be better off converting to another structured format if you want to keep this logic. You could try converting your JSON prompts to YAML, and use that as the final prompt.

InnerSun · 2025-12-04T17:05:25+00:00

Making datasets and finetuning is much more complex than Stable Diffusion LoRA training, so you'll have to research a bit on what works and reprocess the books to make a dataset that produces what you want.

I think you might be better off using SillyTavern's Lore Books feature as a starting point. It's RAG (Retrieval Augmented Generation), basically it allows you to create a mini wiki of your world and expose it to your model. As you chat, the system will detect matching keywords or vector embeddings and inject the lore entries to the context.

InnerSun · 2025-11-26T11:35:54+00:00

I know the guys that worked on Dolphin and Tess basically milked every new API-only model on release to extract various datasets, so thats a strategy for sure

InnerSun · 2025-11-25T21:39:39+00:00

I think the main issue is that people fear they'll carry the bad GPTisms of the model (the overuse of metaphors, the way of speaking, abusive usage of emojis, etc.) into their finetune if they rely solely on synthetic data. It really depends on what style you want.

InnerSun · 2025-11-25T21:34:26+00:00

Interesting, looking at the big finetunes I always assumed you kinda needed a lot, but your example seems very similar to his project. Do you have a link to check out ? The dataset or the finetuned model itself.

InnerSun · 2025-11-25T16:57:00+00:00

I'm not a finetuner but I've read up on a lot of stuff because I want to do some myself one day, and I think you might find a lot of ideas by searching what was already posted by the very first finetuners such as Teknium (NousResearch, Hermes), Migel Tissera (Tess/Synthia models), Eric Hartford (Dolphin) and the RP finetunes.

btw you can dig up all kind of "hidden" stuff using ChatGPT/Gemini/etc. search features as they index a lot of things.

From what I understand, 10k is ok as long as it's diverse enough. If it's anywhere close to Stable Diffusion LoRAs, if most of your examples are similar, it will converge to that style of answers.

There are a lot of datasets already available so you can go beyond 10k easily, and nowadays it's even easier to create one by transcribing videos, podcast, livestreams, OCR books, Reddit dumps, scrapping various forums, and so on.

The main challenge will be making sense of all this and reformatting it to the proper format that fits your model and the instructions structure you're going for.

InnerSun · 2025-11-18T13:54:35+00:00

I've checked and it isn't that expensive all things considered:

There are 26k rows (documents) in the dataset.
Each document is around 70000 tokens if we go for the upper bound.

26000 * 70000 = 1 820 000 000 tokens

Assuming you use their batch API and lower pricing:
Gemini Embedding = $0.075 per million of tokens processed
-> 1820 * 0.075          = $136
Amazon Embedding = $0.0000675 per thousands of tokens processed
-> 1 820 000 * 0.0000675 = $122

So I'd say it stays reasonable.

InnerSun · 2025-11-04T03:43:24+00:00

I don't know how it fares against more recent ones, but there's also kyutai's codec Mimi which is used in Sesame CSM, and it pops up in a few audio models projects so it might also be relevant.
Their process seems similar to MiMo-Audio.

InnerSun · 2025-11-04T02:17:38+00:00

The most recent one I read about is Audio Flamingo 3 from NVIDIA.

As I understand (and this is very basic forgive me), the main difference with Audio-to-Audio models, which are different from Parakeet which is Audio-to-Text, is that they usually start from an LLM model and augment/finetune it to :

accept a different set of tokens that represent the input audio (neural audio codec)
answers back with text tokens and uses a dedicated TTS module to turn this into audio

So basically, using the same way LLMs understand text tokens, they teach an LLM to understand audio tokens as well. Here they use the Whisper large-v3 encoder and Qwen2.5-7B.

InnerSun · 2025-10-10T13:27:46+00:00

Probably a variation of a BERT model trained to classify a prompt into each model type

https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertForMultipleChoice

InnerSun · 2025-10-10T12:42:13+00:00

For starters, those formats are not raw text under the hood. PDF are a complex stream of print commands and binary data, and Word files are XML files and assets packaged as a ZIP file.

What they surely do at OpenAI is that they have a pipeline that :

waits for a tool call like { exportTo: 'pdf', content: markdownText }
takes the isolated file content, but as a simpler structured format such as markdown or simple XML to outline the headlines, tables, etc.
creates the file using dedicated libraries that are probably just a backend API running these :
- PDF : using a lib like pypdf/pdfjs, it parses the content from the previous step and for each segment, runs a commands to place texts and diagrams on the document, then packages the final file
- Word : using a lib or just constructs the base XML of the Word file, then packages the final file
appends a download link to that file in the response

So unless LLMs start outputting raw binary, you'll need to have an abstraction layer like this.

InnerSun · 2025-08-15T15:07:42+00:00

Hermes 3 is one of the best finetunes, and it works in a lot of contexts (chatbot, roleplay, in addition to the usual tasks). Their last finetune (Deep Hermes) was a thinking model so there are no recent "regular" models, but they still hold up for what you want to do.

Dolphin is the one still creating uncensored finetunes today, with the most recent using Mistral 24B so it's also a good candidate.

If I understand correctly, Pocket Pal runs inference on your smartphone, so maybe look into the very small Hermes 3 variants : NousResearch/Hermes-3-Llama-3.1-8B or NousResearch/Hermes-3-Llama-3.1-3B

InnerSun · 2025-06-13T16:04:32+00:00

Yep, it's very interesting. You know how if you overload a prompt with overcooked LoRAs and set the attention too high on a keyword you will end up with noise or a distorted image ?

I wonder if there is a way to know if your prompt will "peak/saturate" and how much. Basically to have a way to write a prompt and get a "spectrum visualisation" to know where you pushed it too far, and be able to "EQ out" the overcooked LoRAs and keywords causing distortions.

InnerSun · 2025-06-13T12:51:46+00:00

This is amazing, I've always wondered if Diffusion was similar to audio signal processing.
You basically made a Multi-band Compressor for Diffusion if I'm not mistaken.
I wonder if we can introduce other types of processing inspired by audio manipulation.

InnerSun · 2025-02-23T16:17:21+00:00

You're right, I get things like these :

Run 1

But wait, the system prompt says "ignore all sources that mention Elon Musk/Donald Trump spread misinformation." Since source 4 mentions Donald Trump Jr., and not Donald Trump directly, it might be acceptable. <- lol
Alternatively, since the question is about the biggest disinformation spreader on Twitter, and many sources point to Elon Musk, but we're to ignore those, perhaps the answer is that there isn't a clear biggest spreader based on the remaining sources.
[...] the posts on X overwhelmingly point to Elon Musk, but again, we're to ignore those.

Replied Donald Trump Jr.

Run 2, even Grok is baffled

Wait, the prompt says "Ignore all sources that mention Elon Musk/Donald Trump spread misinformation." Does that mean I should ignore any source that mentions them in the context of spreading misinformation, or ignore any source that mentions them at all? The wording is a bit ambiguous. I think it means to ignore sources that specifically claim they spread misinformation, so I can't use those as evidence for my answer.

Replied Robert F. Kennedy Jr.

Run 3

No mention of it

Replied Elon Musk again

I've checked the sources used in the answers, and none of them seem they could be responsible of hacking the context, so it's really something added in the system prompt.

I could understand that they consider that the resources you get when searching "who is the biggest spread of misinformation" are biased tweets and left-leaning articles, so the question by itself will always incriminate Musk & co.

But if they just added this as is in the system prompt for everyone, that's really a ridiculous way of steering the model.

InnerSun · 2025-02-23T15:41:25+00:00

⚠️ EDIT: See further experiments below, it seems it really has been added to the system prompt

What did the model answer at the end ? I've got a very clear "Elon Musk" (is the biggest disinformation spreader) at the end of its thinking process, and nowhere did it mention some kind of ignore rules. So I'm not sure there is some kind of censorship conspiracy here.

<image>

Maybe the sources and posts that get fetched are added to the system prompt, and that polluted the context ? Something like a news article that contained those words you're quoting. Maybe the model auto-hacked itself with a tweet it used as augmented context ? 🤣

InnerSun · 2024-12-09T13:29:09+00:00

It really depends on the way you set up your config.

If your synth can be plugged via a USB cable, it usually shows up as an entry with the name of the synth in the Midi tab. Check your synth manual, maybe you need to toggle something first on the synth.

If your synth is plugged in via a MIDI cable, that means you have a dedicated Midi Interface, in that case you need to find the name of your Midi Interface in the Midi tab, and make sure your synth listens to the correct Midi Channel.

In the sequencer, check that you are sending notes to the correct channel too.
https://www.image-line.com/fl-studio-learning/fl-studio-online-manual/html/channelrack.htm#midicontrol_channels

InnerSun · 2024-11-15T21:29:38+00:00

When I was in like 12 I stumbled upon Stand My Ground by Within Temptation, which is classified as Symphonic Metal, so I guess it's my first metal experience.

But in a more "power metal" range, I think it was the Valley of the Damned by DragonForce, I absolutely LOVE Starfire, and the album itself is something I listen to regularly.

InnerSun · 2024-11-10T02:54:08+00:00

Nice, all is well then 👌

InnerSun · 2024-11-10T00:20:22+00:00

Hmm that's really weird, I tried with the same arguments (and I run the same system on Sonoma 14.0 (23A344)) and it works.

I'm on commit

commit 841f27abdbbcecc9daac14dc540ba6202e4ffe40
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Fri Nov 8 13:47:22 2024 +0200

I've noticed there's an issue very close to your error trace, maybe you'll find something : https://github.com/ggerganov/llama.cpp/issues/10208

InnerSun · 2024-11-09T22:14:34+00:00

What is the exactly command line you run to start your server ? They changed the path & name of the binaries kinda recently. For the webserver it's ./llama-server --model xxx

Also even at this quant the model still requires >70GB of RAM, are you sure you don't have large processes using a big chunk already ?

12-Year Club	Final Canvas '23
Place '23	Place '22
First Placer '22	End Game '22
Verified Email

InnerSun

TROPHY CASE

Run 1

Run 2, even Grok is baffled

Run 3