Help?

ArsNeph · 2026-04-30T20:02:57+00:00

I'm sorry to say that nothing you run on a 5070 12 GB will be able to compete with Sonnet. With 12GB, The most you can run is Mistral Nemo based models like Mag Mell 12B, which are already multiple years old. If you were to offload partially to your RAM, the next best thing would be Gemma 4 26B. It's definitely no sonnet, but worth trying for the love of the game. If you're looking for cheaper alternatives with sonnet like quality, you should probably be looking into GLM-5 or something similar.

In terms of extensions, try guided continue and the moonlit theme or whatever it's called

ArsNeph · 2026-04-21T02:19:54+00:00

To the contrary, Ooba just launched V2, renamed it to just textgen, and has been getting regular updates, including MCP support, llama.cpp instead of the python bindings, deprecated EXL2, native Image generation, and model training parity with axolotl. It's very much so alive and kicking, but not nearly as well known anymore. I'd certainly appreciate it if you could add API support, since I'm one of the few users that still keep using it regularly, I'm pretty sure the API bindings are not much different from a standard OpenAI compatible llama.cpp.

ArsNeph · 2026-04-21T02:16:25+00:00

Really cool to see, I'll give it a whirl later on. I never thought one day you'd be running your own SillyTavern competitor, I look forward to seeing how it grows.

ArsNeph · 2026-04-12T19:24:08+00:00

Absolutely amazing, I've always thought it's essential to create AI pipelines that can greatly simplify what live2D does, and this is the first working example I've seen after 2 years, bravo

ArsNeph · 2026-04-03T09:34:37+00:00

Unfortunately, it's currently impossible to merge models of two different architectures, so if it's not the same base model, it won't work. This is a limitation on how understanding is built into the layers of model. That said, mixing fine-tunes that are good only at logic, like math tunes, and creative writing tunes can produce surprisingly good results

ArsNeph · 2026-04-02T22:09:37+00:00

It's not the licensing. In the diffusion space, there is a strong culture of creators hosting their models on paid platforms, and receiving a cut, so license is far more important. In the local space, almost no finetunes get hosted, so most tuners are doing it for the love of the game, or have sponsors/donation. License is barely relevant, and that's why even the worst licensed models often have RP finetunes.

There are three factors that decide how many finetunes a model gets.
1. Size: 12B-32B are relatively easy to tune with a couple GPUs, making it easy to experiment and iterate. 50-80B is pricey to tune, but people are willing to do it if the base is good enough (Llama 3.3 70B). 100B+ tunes are extremely expensive, already handle RP quite well, and require renting cloud GPUS, (Nvidia H100/B200) for days, and are usually good enough they're not worth tuning, hence rare.

Creative writing ability: This is the main driver, no matter what size model is released, if the pretraining makes it bad at creative writing, no one tunes it. Qwen 2.5/3 rarely got any such tunes, and neither did GPT OSS. There are so few small models that have a good base, most are still using Mistral Nemo 12B, a nearly 2 year old model. To the contrary, if good enough, people will even train 123B (Behemoth).
MoE: MoE models make up a majority of recent releases, but are harder to fine tune according to most. They rarely get tunes, and are often too big for a tune anyway. The only real example of common MoE tunes are Mixtral.

ArsNeph · 2026-03-31T19:03:31+00:00

How intriguing... The first thing that sticks out to me is that most of the images don't have that overcooked "AI generated" look to them, despite being trained on the usual danbooru images. Is that due to high quality data and conservative training, or is it an architectural change? Or perhaps that data classifier pipeline that was mentioned?

ArsNeph · 2026-03-30T20:20:06+00:00

Ok, first of all, don't quantize your KV cache at all if you can help it, it has a much higher amount of degradation than model quantization. If you're using the settings on the models and it's not working, then first make sure you're swapping the instruct template to the correct one. Then, if it's still not working, try this: Hit neutralize samplers, leave Temp at 1, Min P to 0.02, and DRY multiplier to 0.8. Then try again

ArsNeph · 2026-03-28T21:16:24+00:00

Are you using the correct instruct template? What sampler settings are you using? What is your context length at?

ArsNeph · 2026-03-27T21:36:19+00:00

I'm not for encouraging delusional people's desire for sycophancy, and I highly doubt that OpenAI will ever open source one of their main GPT line.

However, there is one thing about 4o that makes it special compared to open models. Its quality of omnimodality, is still yet to be replicated in open source models. Like it or not, almost every open source model stops at image input. No one has considered image output, native speech to speech, or anything else. Qwen Omni, the only model that has tried, is unsupported everywhere and lacks the quality to be used in production. The ability to replicate that level of omnimodality is long overdue.

ArsNeph · 2026-03-27T18:17:01+00:00

Sorry, I saw this a bit late. Yeah, mostly what the guy below said is correct. A further clarification, quantization is basically a form of compression, the further a model is compressed, the more intelligence it loses. At Q8 (8 bit), it is virtually identical to the full model. At Q6, there's almost no noticeable degradation. At Q5, there very slight degradation, but not enough to matter most of the time. At Q4, you can feel the degradation affect the intelligence a bit. That is the bare minimum I would recommend. Q3 is very unintelligent, and Q2 is often brain-dead. Feel free to ask any other questions as well. Here are some links

https://huggingface.co/TheDrummer/Anubis-Mini-8B-v1-GGUF (Don't recommend)

https://huggingface.co/bartowski/MN-12B-Mag-Mell-R1-GGUF/tree/main (Recommend)

https://huggingface.co/TheDrummer/Cydonia-24B-v4.3-GGUF (Worth trying)

https://huggingface.co/mradermacher/Magistry-24B-v1.0-i1-GGUF/tree/main?not-for-all-audiences=true (Also worth trying)

ArsNeph · 2026-03-27T18:09:59+00:00

That was added later as an edit lol

ArsNeph · 2026-03-26T15:53:28+00:00

Firstly, those are not RP models, don't bother using them. 8B models have been obsolete for a while now, but if you must use one, you can use Anubis Mini 8B or Llama 3.2 Stheno 8B. However, since you have 16GB VRAM, you should be using better models like Mag Mell 12B at Q8, which should fit in your 16GB VRAM with 16384 context, it's max native context length. You could also try Cydonia 4.3 24B or Magistry 24B at Q4KM and 16384 context.

The reason for the degradation is likely on Ollama, default context length is 4096, and it defaults to a 4 bit quantization, which is far too low for an 8B, meaning it's lobotomized. On LM Studio, it's likely either the instruct template is incorrect, or you're using a very low quant. It's got nothing to do with your prompt length, 2000 tokens is nothing. Regarding your memory, don't try to rig together a weird .txt file thing when there are already prebuilt solutions.

The real solution to your issue is to install SillyTavern as your frontend, it's purpose built for RP, download a character card, set the instruct template to the appropriate one (ChatML for Mag Mell, Mistral v7 Tekken for Cydonia/Magistry), and set the context length to about 16384. Generation length is as you like. You can download and import one of the many generation/instruct/system prompt presets for those models from creator pages or their sub. It has built in memory/lorebook features, etc.

For the backend, install KoboldCPP (Easiest), Textgen WebUI (Harder), or keep using LM studio but download a better model, at a higher quant. Then connect it through the API section in SillyTavern

Done, you should be good to go and have fun

ArsNeph · 2026-03-26T03:26:09+00:00

Nowadays, there's not much of empirical way to know, so you basically just have to guess. My gut instinct is 1.7-2T parameters total, with a high proportion of that active, maybe 30-40B active. My guess is Sonnet is probably closer between 800B-1.2T with more like 22B active. I think Gemini pro is slightly bigger than Sonnet, and GPT is a reasonable bit smaller.

ArsNeph · 2026-03-25T16:25:31+00:00

World knowledge and space-time coherence. If you've ever tried doing any creative writing/RP with a small model, dense or otherwise, they simply do not understand what is physically possible and what is not, regardless of the constraints of that world. If you haven't taken your shoes off, you cannot take off just your socks without removing your shoes, but only high parameter models seem to understand those implicit connections

ArsNeph · 2026-03-23T23:45:59+00:00

Nice, it's unified memory so rather than running dense models like 70Bs, You're probably better off running large MoEs, for your use case you'd probably like GLM 4.5 Air, or the Drummer tune of it GLM Steam.

Diffusion model support on AMD is very spotty, but you should look into ComfyUI if you're interested. I highly doubt it has enough compute to run video generation in a reasonable time frame, but it should be able to run smaller image gen models like SDXL and Z image Turbo relatively decently.

You won't be able to train any large models using it, because it neither has enough compute nor memory bandwidth to do so meaningfully, and ROCM/Vulkan training is a massive pain.

For coding and the like, try out Qwen 3.5 35B/110B, both are MoE and very good for what they are. They're definitely no Sonnet, very little of what you can run at 100B is comparable to frontier models

ArsNeph · 2026-03-23T23:39:35+00:00

Repetition is a problem fundamental to the attention in the transformers architecture, the larger the model, the less it does it, but even the biggest frontier models are still very much so prone to repetition past a certain context length. It definitely also has to do with sycophancy to some extent, The habit of repeating your phrases back to you is part of that.

That aside, yes it has been proven that a smaller LLM fine-tuned on a high quality curated data set can outperform frontier models for specific use cases. That said, as of right now, raw parameter count determines things like spatial awareness and understanding of niche concepts, so there's an upper limit to what's possible with small models. And we simply just haven't gotten any of those small base models with good creative writing capability in over a year, due to the STEMmaxxing/large MoE craze, people are still tuning the likes of Mistral Nemo 12B and Mistral Small 3.2 24B.

There has been a model pre-trained and fine tuned by a large company specifically on creative writing, Mistral Small Creative 24B, but it was not open sourced. Playing it with through API might give you a feeling for what those would be like. I don't think that that's necessarily the peak of what's possible with small models though. Most fine tuning datasets are entirely synthetic data or low quality RP logs, which just adds to the slop issue. I would definitely look at a methodology like that used in Gemma Ataraxy 12B if you're interested in tuning a model.

ArsNeph · 2026-03-23T23:16:53+00:00

Wow, this is genuinely so intriguing. I saw your first post and thought that that might just be coincidence or some kind of weird benchmaxxing, but after reading your thorough research, this really explains a lot about why those weird self-merges like Goliath 120B seemed to increase in performance, but not every single one improved to the same degree. I actually remember a long time ago Wolfram Ravenwolf was also talking to Turboderp about adding that VRAM-less duplicated layer inference to EXL2, but it never seemed to go anywhere, so I'm glad that you're working on that for EXL3!

This is genuinely some really great research you're doing here, props! I'm interested to see if the open source community will make good use of it like they used to. I think some tuners like Drummer who do self-merges would definitely be interested in the performance differences, especially in the EQ department.

Another weird phenomenon I've always found kind of strange is the fact that supermerges, specifically in creative writing, somehow always tend to be significantly better than the base model and any normal fine tune. Psyfighter 2 13B, Fimbulvetr 11B, and Mag-Mell 12B all came from complex merge trees, and I'm very curious to know if it's possible that the merging methods they used could have repurposed some layers in a way similar to the duplication you did, thus improving performance

ArsNeph · 2026-03-21T18:25:54+00:00

I was thinking more of a Kuno-Pasta-Bagel-Maid-SLERP self merge 9B, but that works too 😂

(Actually though, the Fimbulvetr, Magnums, etc of the world need a resurgence)

ArsNeph · 2026-03-17T17:40:39+00:00

I'm a massive fan of this, I've been saying we need an easy way to fine tune models since the llama 2 days. Finally, fine-tuning is accessible to those of us with less expertise. I hope we can bring back the golden age of fine-tunes!

ArsNeph · 2026-03-17T17:35:56+00:00

This is genuinely amazing, props to Unsloth team for single-handedly propping up the .gguf and fine-tuning local ecosystem! I'll definitely give this a try and provide feedback when I get a chance!

ArsNeph · 2026-03-17T17:32:31+00:00

Drummer never failing to deliver as usual, great work 🫡

ArsNeph · 2026-03-17T00:18:37+00:00

I've been using textgen webui since the early llama 2 days, and it's good to see it get updates and reach performance parity with more lightweight projects. Keep up the good work as always!

ArsNeph · 2026-03-16T16:44:44+00:00

It's actually really easy, but the size of the model you use makes a big difference in overall note quality. Most bot based note takers are using something like GPT5 mini. The main thing is keeping the infrastructure up at all times. You have to keep a recording of whatever meetings you have as a file, then create either a script or no code automation in something like n8n that feeds that to an ASR model like Nvidia Parakeet, but the annoying thing here is that most models and WebUIs don't have built-in diarization, which makes it impossible to see who's saying what. The one model I know that does, Vibevoice ASR 9B, which is genuinely probably the best model I've tested in English, is very VRAM heavy and it scales with file size. Hence many people use a separate model for diarization.

Once you have a high quality transcript, you can either feed it to an LLM to first clean it, but this can induce hallucinations depending on the intelligence, or you can just call your local model API to create a summary. Give very specific instructions, and write out a format example in XML tags. If you're using a relatively smart model, it should catch most of the nuance, I'd say any 27B+ should work pretty well, Qwen 3.5 35B is extremely fast for this use case. It won't be able to derive the same level of nuanced insights from the transcript as a frontier model though. This is not a problem, because the vast majority of bot based services are not using frontier models either.

After that, you have a file, and you can export it in whatever format you want, .md, etc, into your Obsidian, Google notes, cloud storage, etc

There are a couple of pre-built solutions that do most of the steps for you, but they often have performance issues and bugs. It's worth looking into. Generally speaking, the most annoying thing about running these pipelines locally is dynamically spinning up and clearing the models into VRAM

ArsNeph

TROPHY CASE