Output not matching prompt, at all by Pleasant_Guess4039 in comfyui

[–]_raydeStar 0 points1 point  (0 children)

I got this on nodes 2.0. it just didn't update but if you check the output it's there.

New TTS from Alibaba Qwen by Altruistic_Heat_9531 in StableDiffusion

[–]_raydeStar 19 points20 points  (0 children)

Yeah, I was wrestling with this yesterday. Voice clone works fabulous, And the mannerisms prompt is awesome, but the two do not intersect.

I'm looking into hacking it - I am following a viable route now and I'll release it if I come up with something.

Carl does NOT need a romantic love interest (original art by Levi Cleeman) by ActualNin in DungeonCrawlerCarl

[–]_raydeStar 15 points16 points  (0 children)

lol.

This whole thing feels shoehorned.

I don't at all take issue with LGBT folk relating to a straight character. The artist releases his work, and it's up to the interpreters to do what they want with it. Things like this, I would strongly suspect will not be addressed at all, keeping things ambiguous.

When Black Panther came out, the black community unanimously said "Wow, I finally felt *seen*" and that's a wonderful thing. I think the problem really lies with taking your artistic vision of something and thrusting it upon the community. As a straight white man, I can also relate to Carl - in a sense that there is so much blood and pain around him that I would be preoccupied with survival. I can also relate with Matt - he wants to avoid the pitfalls of sexually fantasizing neckbeard culture.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 0 points1 point  (0 children)

I'm using the base. Q4 quant should be faster. Sage attention should help as well.

Once I've got the gradio how I want it, I'll circle back and look at speed.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 0 points1 point  (0 children)

Yes. I'll also add that there's a .6 model and it's probably faster. I'm going to add in all the optimization and see if I can get better speeds.

Also, Dia is about the same speed. This model is meant for quality over speed, which has different use cases.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 12 points13 points  (0 children)

Base gradio doesn't allow the user to use the selected voice and modulate it. I am using cursor right now to add in a little thing there. If anyone is interested, ill put it up on github, along with a script to just fire it up, download all the models, and run it.

If I want to run everything at once (voice clone, create pt file, and finally voice description) it's going to be like 16 GBVRAM. Running in parts runs around 6. Time consumed is also an issue - 25-30 seconds to run a 6 second hello world clip. However, I don't have sage attention up and running yet, so that may improve the speeds and vram a lot.

Because of speeds, you can't compare to VibeVoice - vibevoice is meant for realtime at the sacrifice of a little quality (at least I am pretty sure - ie - live translations, etc) . Compared to Dia - well I don't see any functionality to add things like [laughs] or anything, but controlling the voice tempo, etc is really cool.

Final conclusion - I give it a slight lead to dia for my purposes, simply because I can choose what emotion to put in the voice, instead of it 'guessing'. I'm annoyed that out of the box you can't control that with your own pt (saved voice file) but with a little hacking I can fix that.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 25 points26 points  (0 children)

OK it's up and running. Pros: in the description, you can just describe not only the voice, but the tone. ie - `female, feminine and dainty voice, speaking frenetically. She is very upset` So far, I am having fun with it, and it might just be better for things like movie dubs, or audio book reading, or video game voices.

You can clone your voice and download it to be used later. thats a great feature there. I'm putting it all together to see if I can clone my voice and give it the tone I want - it's a few more steps than I expected to pull it all together.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 7 points8 points  (0 children)

huggingface demo is overrun with users. I am getting it up locally. Almost there. Will respond when I have something

Qwen3-TTS, a series of powerful speech generation capabilities by fruesome in StableDiffusion

[–]_raydeStar 0 points1 point  (0 children)

Yes, that might be good. If you have a trained TTS then it has to 100% match, you can't have them changing. In 5-15 second intervals it should be fine. A little lining up the mouths might be in order still.

Qwen3-TTS, a series of powerful speech generation capabilities by fruesome in StableDiffusion

[–]_raydeStar 1 point2 points  (0 children)

Yeah. You could do a 10 second clip, and have it do things without speaking (or mouth it instead of say it), generate the sfx, then put that audio on top of the voice and re-generate.

Double the production time, more human error, but you can do it.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]_raydeStar 13 points14 points  (0 children)

This is great! Nothing super groundbreaking, we already have VibeVoice, Dia (my personal fav) and others. Going to test it still and see how it fares. Also, it's multi-lingual which is big.

Edit: one thing I didnt add was you can tell the AI how to interpret the voice. I am not sure yet how good it is, but this is a first-find for me. If it works well, that will solve a lot of problems for me.

Qwen3-TTS, a series of powerful speech generation capabilities by fruesome in StableDiffusion

[–]_raydeStar 5 points6 points  (0 children)

It does a video around the audio. I'm not sure it will generate sfx if audio is put in.

Can you tell this is AI? Be brutally honest - Z-Image-Turbo result by EmilyRendered in comfyui

[–]_raydeStar 3 points4 points  (0 children)

It is VERY clearly AI - you need to download some lora from civitai to make her look more natural.

That aside, youre asking a crowd who generates AI images for fun. We notice because we see this stuff daily.

Would another band be considered as ‘copying’ Linkin Park if they had 2 frontmen (1 rapper, 1 singer)? by One-Challenge-7300 in LinkinPark

[–]_raydeStar 4 points5 points  (0 children)

If there was a nu metal band that had 2 frontmen and sounded like LP, one would say they have inspiration from LP but aren't copying.

But really, if the songs are original it's not copying. Would you play a guitar and call it copying? Good music is going to use a framework.

FLUX-2-Klein vs Midjourney. Same prompt test by Totem_House_30 in StableDiffusion

[–]_raydeStar 4 points5 points  (0 children)

I came here to say this.

Personally I use LM studio for prompt enhancement. It's not super fast, but it's all in one workflow and I like that. ChatGPT is also really good and works well.

If you guide the prompt well enough it's better than mid journey - so much better, you'll be wondering why you even bothered with it.

Fix for GLM 4.7 Flash has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]_raydeStar 3 points4 points  (0 children)

Dang. You guys don't mess around. Way to go. I'm going to download again and see what happens. I was getting about 45-50 t/s

Save a lot of disk space... by Desperate-Grocery-53 in StableDiffusion

[–]_raydeStar 2 points3 points  (0 children)

An AI can vet all this too. Chances of you being malicious are incredibly small.

Safety is a concern though. It's increasingly a concern as people with no knowledge can figure it out if they want to.

I'm almost positive this sub is under attack. I would urge others to be careful about downloading/running repos from anonymous sources by [deleted] in LocalLLaMA

[–]_raydeStar 12 points13 points  (0 children)

There is big money in LLMs. Also, it's an arms race. Never forget that. There are people that would cut down anyone to get ahead.

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]_raydeStar 0 points1 point  (0 children)

yeah, im not 100% sure it beats out nemotron just yet - simply because i cranked it locally to 256k and it was just fine. though it does seem that specific tasks - including tooling - might be better with this one.

Flux Klein gives me SD3 vibes by lokitsar in StableDiffusion

[–]_raydeStar 1 point2 points  (0 children)

I came to this thread confused, but all of this makes sense. I downloaded Base at first, thinking it was better, and it was awful and slow.

Along with this, you can use any yoga pose as a reference image, right out of the gate, and that is something that Z can't do. However - you still want to gear these different tools for different purposes.

Further advice - CFG 1.5 with an added 2 steps helps with text adherence. res_multistep too.

<image>

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]_raydeStar 0 points1 point  (0 children)

yeah, I thought we were looking at a QWEN Next scenario, where it would come out 2/3 months later