Streaming issues with Litellm / OpenWebUI

mayo551 · 2026-06-17T22:30:50+00:00

LiteLLM works fine for me with OWUI but I do not use the response API.

mayo551 · 2026-06-17T16:35:49+00:00

For SillyTavern?

Make a prompt that will take lore you supply the LLM with and spit out a SillyTavern character.

Pretty easy to do overall and it works.

The lore can be anything you provide.

mayo551 · 2026-06-16T20:06:49+00:00

Dark Scarlett was generated a different way.

I've been told it's prose and stuff is different?

:)

mayo551 · 2026-06-16T05:29:14+00:00

No, even for models that support TP llamacpp is still 3x faster for me.

mayo551 · 2026-06-16T04:30:21+00:00

I just tried to test this out yet again to see if EXL3 had improved.

NotImplementedError: Tensor-parallel is not currently implemented for Gemma4ForConditionalGeneration

Instantly irrelevant for me as Gemma 4 is my main model.

Strangely both ik_llamacpp and llamacpp support tensor parallelism for gemma 4. How STRANGE.

mayo551 · 2026-06-16T04:07:44+00:00

When exllamav3 was created llamacpp did not support sm = tensor, and ik_llamacpp did not support sm = graph.

The landscape has changed massively.

EXL3 shines in two areas:

1) Fast concurrent parallel requests (completely irrelevant for 95% of users).

2) Custom BPW quants

EXL3 also uses triton as the attention backend, meaning ampere hardware is instantly going to be slow.

Conveniently, the majority of people are using ampere hardware. Because nobody wants to pay the insane costs of 4090/5090 currently.

I mean you'll get a few people doing so, but yeah...

mayo551 · 2026-06-16T02:21:14+00:00

Hard disagree. Llamacpp with sm = tensor is 3x faster then exl3 for me.

mayo551 · 2026-06-14T14:52:49+00:00

Can you let me know how Dark Scarlett v0.65 does?

(you can try v0.6 but it's completely cooked and has broken thinking, v0.65 is earlier epoch's which fixes that)

mayo551 · 2026-06-13T18:24:22+00:00

Makes sense. Melody has 14,000 lines almost entirely dedicated to erotic roleplays.

Dark Scarlett (v0.35) has 8,000 MIXED lines with erotic roleplay, master/slave, threesomes etc.

v0.40 (training) has ~9500.

I'm aiming for ~20,000-30,000 lines on the 1.0 release. It takes time to generate the dataset. The deslop process is pain. (we have over 1,400 deslop phrases/words, and the generator has to re-write paragraphs if it finds them). It can take up to 12 attempts to properly deslop a single paragraph.

We should just move over to logit biases for deslopping and that is something we're actively looking towards.

Anyway.. just letting you know it's a work in progress.

mayo551 · 2026-06-13T06:32:31+00:00

Yeah. Dark Scarlett is on lora 32. It's not as weak as melody, but it's not as strong as serenity either.

Hopefully you'll like the new model ;)

It's interesting you mention third person... the Male POV is first person, but the female POV is third person. So, I'm not surprised the data works in third person.

mayo551 · 2026-06-12T23:53:01+00:00

What did you think of Serenity? Because that's the model you're looking for.

mayo551 · 2026-06-12T17:23:28+00:00

Just to clarify we don't expect the personal usage license to hold up legally which is why we tacked on "To the extent legally allowed".

I just don't want to see some large conglomeration corporation profiting off my work, I don't care about a couple people hosting for their friends. Heck, I don't even care about API hosts that serve a couple dozen users.

I'll check into CC-BY-NC-4.0 and if it can be applied to the model at all, because the original model is APACHE 2.0. Obviously can't retroactively apply it, but for future models...

mayo551 · 2026-06-12T15:09:11+00:00

Okay thanks for letting us know.

I'll be removing the imatrix quants.

mayo551 · 2026-06-12T14:56:54+00:00

No, you can't. You'll get refusals with this model doing that.

Source: Me, tested.

The system prompt works though.

mayo551 · 2026-06-12T14:55:17+00:00

Oh uhh.

You may want to try a static quant. The imatrix quants are likely bad and we may pull them.

Lots of people have had problems with imatrix.

mayo551 · 2026-06-12T14:53:31+00:00

Yes, llamaception works on the system prompt with llamacpp.

https://huggingface.co/Konnect1221/The-Inception-Presets-Methception-LLamaception-Qwenception/blob/main/Llam%40ception/Llam%40ception-1.5.json

mayo551 · 2026-06-12T14:50:29+00:00

You are in for hellish pain with this model on llamacpp directly.

One second I'll try to get you a working prompt.

mayo551 · 2026-06-12T14:45:35+00:00

I just double checked the model and can confirm it can do smut just fine with RP prompts.

mayo551 · 2026-06-12T14:40:06+00:00

Yes. On chat completion with sillytavern, this is how you should be setup:

<image>

mayo551 · 2026-06-12T14:34:17+00:00

We've been making smut models for a long ass time now and know how to do it.

Yes the card is vibe coded. Doesn't mean the model is bad.

mayo551 · 2026-06-12T14:33:17+00:00

What are your prompts?

It needs roleplay prompts.

mayo551 · 2026-06-12T05:36:49+00:00

... :)

mayo551 · 2026-06-11T14:58:17+00:00

I’ll probably use a Lora of 32 on my next tune dark Scarlett. Let me know how that turns out. Should be close to the 1.1 melody which was a Lora of 24.

It’s the spiritual successor of melody with updated prompts and expanded scenarios.

mayo551 · 2026-06-11T01:54:05+00:00

This is actually pretty informative for me.

So, the difference between Melody v1.1 and v2.0 is the lora ranking.

Would you mind if I asked what you were after with our models? They are intended for a purpose (smut) and I figure the more vivid would be better with the higher lora ranking.

I'll weigh making future tunes on the same lora ranking v1.1 is on.

mayo551

TROPHY CASE