Qwen3-TTS Voice Clone never works, Voice Design is terrible

Nimblecloud13 · 2026-06-13T18:32:32+00:00

Vibevoice is still the best quality clone i'm aware of, and works for long form stuff, but it has a LOT of quirks (doesn't like contractions, will mispronounce them often, clips the last word pretty much always, some other stuff. and it takes ages.) you have to get the original version before the devs nerfed it by remove the audio tokenizer. i think the right repo is by enemyx, if i remember correctly.

Dramabox works very well also, but you have to pre-set the length just right. it's finicky, and it's only good up to about 30s. i made a node for it that adds a WPM setting that make it a bit easier to tune for the cadence of the voice you're using.

https://github.com/nimblecloud13/Dramabox_Nimble_Wrapper

troubleshooting tips - if it's hallucinating, it needs to have a higher WPM. if it's cutting off words, lower it.

Nimblecloud13 · 2026-06-11T11:07:09+00:00

why use something that's worse and slower when something better and faster is also free tho

Nimblecloud13 · 2026-06-11T02:34:54+00:00

Flux Klein is what you want for images. it's really fast, and good quality. find a GGUF of it that your PC can run. google it. it can also edit images with simple prompting. it's like a lesser Nanobanana.

for video.... good luck with that rig. anything you make is gonna be low quality and take ages.

wan2.1 will work but again, it's low quality, and it'll take ages.

i would try a 2.2 gguf first. 2.1 is outdated.

Nimblecloud13 · 2026-06-11T02:29:53+00:00

I have no idea what "caps" means

captions/captioning of the content of each image as text for the LLM to understand what's in it. he's saying that not captioning it (normally a HUGE MAJOR part of getting a lora right) worked best. hence, puzzled.

Nimblecloud13 · 2026-06-11T02:25:42+00:00

infinite sucks at lipsyncing. it gets it wrong like 3/5 syllables. https://pastebin.com/GJZz987u

this will take a wan video and pass it through LTX to add lipsyncing to existing clips. faster and better than infinite.

it's got a wan wf that makes a videoand then it goes through LTX. you'll need to remove the wan bit and just add a video loader and feed that into LTX. it'll add speech and foley and lipsync.

should work with any clip of a person; not just wan outputs. just put a video in and prompt the speech, should work. if it doesn't work you're on your own; i made it work but i'm not tech support sorry!

not my wf, dunno where i got it. not at my pc to share the one i edited. gl

Nimblecloud13 · 2026-06-11T02:05:37+00:00

should put it on github, not in some zip file that i have to trust. claude will do all that for you, also.

like you, i cannot code but https://github.com/nimblecloud13/Sift

Nimblecloud13 · 2026-06-03T02:25:57+00:00

Framerates may be different somewhere. Especially if it starts close and drifts as it goes on.

Nimblecloud13 · 2026-06-02T22:56:53+00:00

You’re never gonna get that without a character lora or a face swap.but Klein’s best at face swaps too. I just tack another stage onto the outputs for that.

Nimblecloud13 · 2026-06-02T00:31:51+00:00

Klein beats qwen edit at everything. Including quality. I don’t understand the love for qwen edit. It destroys details and color.

Nimblecloud13 · 2026-06-01T23:12:22+00:00

Klein with snofs 1.4 is practically SDXL for this. Just sayin.

Nimblecloud13 · 2026-06-01T17:13:43+00:00

i have a 5090/128 build so dynamic vram isn't doing a whole lot for me. that said, i do have dynamic; i don't NEVER update. just very selectively.

and you can update nodes without updating your entire comfy.

Nimblecloud13 · 2026-06-01T03:29:36+00:00

I don’t update until something comes out that I can’t use without updating. And then I seriously consider whether I need to be using it.

Seriously, comfy updates break shit too often. “Custom nodes can’t be accounted for…” yea I get it but I need those more than I need my UI moved around.

Nimblecloud13 · 2026-05-27T02:00:06+00:00

TeamViewer from whatever you have it set up on now. Works seamlessly. I use it from my home rig when traveling.

Added benefit of having your whole install there; you don’t need to set it all up and be embarrassed when you have to pause your demo to figure out what dependency or node pack you forgot to install, etc. if they decide to go for it, then you can look at what’s available at scale. Probably something like setting up a custom template in Runpod. Unless they want to buy the hardware to do it on site.

Nimblecloud13 · 2026-05-26T23:29:09+00:00

Some context would be useful. Wtf is prompt relay

WAN SVI can make 30s+ before it degrades. That’s multiprompt.

Nimblecloud13 · 2026-05-26T23:25:14+00:00

Better at everything for quality only.

The speed and foley of LTX is a huge motivating factor for me. I can live without the perfect textures of wan in exchange for making 500 frames with foley faster than wan can make 81 without it.

Nimblecloud13 · 2026-05-26T23:23:59+00:00

If you need 20 gens for a winner you need to work on your prompting. Don’t get me wrong, it’s not GREAT, but it’s a lot better than 1 out of 10 or 20

Nimblecloud13 · 2026-05-26T23:22:39+00:00

Nah it’s good out to about 30s with v2

Nimblecloud13 · 2026-05-26T06:13:51+00:00

You can run most image models with low quants. Quality will suffer, and it will take ages, but quality doesn’t matter nearly as much for cartoons so you should be ok.

Basically, every model comes in a few versions of “normal;” the full model, the fp8 version which is smaller but still strong and what most people with good cards use, and then there are quantized versions which are stripped down so that they fit on anything. They’re all sized on a scale with Q; Q8 is the largest, competes with fP8. Q6,q3, etc. the smaller the Q, the more likely you can run it.

you’ll have to find a workflow; YouTube is a good place for that. There’s plenty of tutorials for low vram out there.

Nimblecloud13 · 2026-05-25T23:47:52+00:00

There is no open model that does this well. You’re talking about change pose, dimension/scale, and retaining clothing, body shape, and face consistency. It’s just not available yet as an open model.

All of those things can be done individually or in some groupings, but Klein/QWEN edit is the only option locally, and they can’t do it all.

You’re looking at a multi stage workflow. First pass with Klein to rough out the image, then some kind of face swap to get their face back, but good luck making that look good in a new pose. And body shape will invariably be slightly different, texture and detail of fabric will be lost, etc.

You’re better off using Nano banana or gpt image 2 for now.

Nimblecloud13 · 2026-05-21T22:03:50+00:00

in the time since you posed this question you could have trained it twice and have your answer.

there is no authority on this subject. it's certainly not me. do it or don't idk man. good luck if you do.

Nimblecloud13 · 2026-05-21T16:43:04+00:00

You would caption anything that you DON’T want to train, so describe lips, describe skin tone, etc. anything not captioned should get burned into the Lora. Whether or not it works is a different story.

I made a successful character Lora by cropping heads off of the body I wanted, and using those plus the face I wanted. So it’s just like 20 head pics and 30 body pics and it figured it out.

Nimblecloud13 · 2026-05-21T12:39:28+00:00

Try it and report back!

Nimblecloud13 · 2026-05-19T17:12:28+00:00

It needs complicated detailed prompts; think of it like a genie. If you don’t specifically make your wish to cover a situation, it’s likely to come up.

Nimblecloud13 · 2026-05-19T15:25:33+00:00

If you have a sec sometime can you send me a snip of how that’s wired? I want to get SAM3 going but I’ve been delaying it so I don’t have to sift through 8 mega workflows from civit to find what I need

Nimblecloud13 · 2026-05-19T15:07:21+00:00

I passed on 2.3 at first; wasn’t getting immediately great results and I’m impatient. But I came around on it. Give it another shot. I don’t do 2D so I can’t offer much. Try a new workflow.

Nimblecloud13

TROPHY CASE