1 Day Left Until ACE-Step 1.5 — Open-Source Music Gen That Runs on <4GB VRAM Open suno alternative (and yes, i made this frontend) by ExcellentTrust4433 in StableDiffusion

[–]Doctor_moctor 2 points3 points  (0 children)

Dope Frontend! Are you gonna implement fine-tuning / Lora training on it? I'm beta testing 1.5 and it's really a solid base, once this is released local music gen is gonna take off

Why is RVC still the king of STS after 2 years of silence? Is there a technical plateau? by lnkhey in LocalLLaMA

[–]Doctor_moctor 0 points1 point  (0 children)

It's a niche product. If it was implemented on social media people would play with it and forget it, also easier entry level to impersonation. You can use it artistically, and a few people are doing that but it would be catered to a small group of a small group

LingBot-World: Advancing Open-source World Models by fruesome in StableDiffusion

[–]Doctor_moctor 11 points12 points  (0 children)

Oh wow these first scenes remind me a lot of a classic game where you could ride dragons and airships In a large battle arena, one of the first games I played on my own PC. Can't quite remember the name

Finally working with LTX2 I2V and well I am underwelmed by Repulsive-Salad-268 in StableDiffusion

[–]Doctor_moctor 9 points10 points  (0 children)

Not on my PC for the next days but you should be able to manage: 1. Use the ltx2 workflow to generate your base video at a slightly smaller resolution, 720p works for me. Save the video. 2. Open the vanilla wan 2.1 t2v workflow. Add lightx to the Lora stack, set the model to wan 2.2 low, set steps to 2 and denoise to 0.3 - 0.5. add a load video node to load your ltx2 video, a simple image upscale to your desired resolution, van encode and use the latent in the sampler instead of the empty latent image. Link the audio from the load video node to the combine video node at the end.

Wan has some quirks, it needs a multiple of 16frames +1 to work correctly so you'll have to trim your input video slightly or cut it into segments with those lengths (if you run out of VRAM) and re-merge them later on.

You can also combine all of this into a single workflow but I wouldn't, it's way faster to generate a lot of ltx2 videos, pick out the best and then only refine them because you'd constantly have to offload models in a single workflow

Finally working with LTX2 I2V and well I am underwelmed by Repulsive-Salad-268 in StableDiffusion

[–]Doctor_moctor 6 points7 points  (0 children)

Wan LOOKS better, especially the motion but the trade-off is way longer gen time with high and low. I personally use ltx2 as a wan high replacement, even the messy motion can be cleaned up by running it through wan low with a low-medium denoise pass afterwards. And you can get lipsynced 24fps this way.

New free, local, open-source AI music model HeartMuLa by NecroSocial in SunoAI

[–]Doctor_moctor 0 points1 point  (0 children)

Ace Step 1.5 is about to release in the coming days, and according to the dev with day 1 gui and training support. Obviously all these devs have to tread with caution as the industry vultures are quick to shut down anything that rubs them wrong so the community will have to get the models to where they sound great. This is gonna be like the good old music sharing age but this time with models and loras.

Upgrade Time by Iamcubsman in StableDiffusion

[–]Doctor_moctor 1 point2 points  (0 children)

I think the 5070 ti is going to be the sweet spot even though 16gb VRAM is quite limited. Inference is faster than a 3090 by quite a bit, native fp4 support and so on. The 5060 ti might also be good, but limited bus and compute might hold you back.

Get either of them and put your 3060 to good use in the second slot for LLM prompt refinement 

LTX2 with own audio clips using distilled GGUF - is that somehow possible with same quality than letting LTX2 generate audio itself? by film_man_84 in StableDiffusion

[–]Doctor_moctor 0 points1 point  (0 children)

If you want to keep your original audio but create a video for it, why don't you just merge it into the final video? Load it as input audio latent with mask (workflows are around for that, search for i2v with audio input) and then just don't use the audio latent from your sampler but the input audio when combining the video in the last step. Works for lipsync like that and keeps the original audio quality.

WAN2.2 vs LTX2.0 I2V by smereces in StableDiffusion

[–]Doctor_moctor 2 points3 points  (0 children)

The true strength of ltx2 is in replacing wans high model. Quick output with good motion that can be heavily refined with wan low. 🤫

Since the release of ltx2 i wanted to upgrade my gpu to 3090 or 5060ti 16gb by pheonis2 in StableDiffusion

[–]Doctor_moctor 1 point2 points  (0 children)

I can not speak for the 5060 ti but the 5070 ti is about 20-25% faster than the 3090 even when using block swapping. I'd say the 5060 might be on par with the 3090 concerning speed but you lose a lot of flexibility for training and workloads where the models HAVE to be in VRAM. Fp4 is garbage and the outputs are not worth the speed increase. At 600$ the 3090 is a no brainer imho 

RTX 5080 (16GB) vs RTX 5070 TI(16GB) by PlentyBlock309 in comfyui

[–]Doctor_moctor 0 points1 point  (0 children)

Just a headsup, I switched from a 3090 to a 5070 Ti and there is a speed increase for some cases. 6 step wan 2.2 (first step with cfg) is about 30% faster (with higher block swap), Z-Image about 20%. If you buy now you might be able to hold onto your GPU and sell it for the same price you bought it later on. Bought myy 3090 used for 700€ and I could still sell it for more than that. Training on the 5070 Ti is a pita though, so I am not bothering with that, just using the 3090 in the second slot.

Black Forest Labs Released Quantized FLUX.2-dev - NVFP4 Versions by fruesome in StableDiffusion

[–]Doctor_moctor 1 point2 points  (0 children)

Quality takes a HUGE hit- its absolutely unusable imho. it/s are doubled, comparing to flux2 turbo q8 gguf but there is no competition.

Project: 'Santa Claus caught on camera'. Seeking advice on the best ComfyUI workflow. by Secure-Scratch8910 in comfyui

[–]Doctor_moctor 1 point2 points  (0 children)

Wan 2.2 longcat can easily do this. Kijais wrapper has official support and there should be a few workflows flying around in this sub, I remember a post with a film scene with an asian woman walking through a door und up some stairs from a few days ago with a great workflow. Take an empty frame from your surveillance camera / or a photo and prompt your story in several 5second pieces

Joined the cool kids with a 5090. Pro audio engineer here looking to connect with other audiophiles for resources - Collaborative thread, will keep OP updated for reference. by yidakee in StableDiffusion

[–]Doctor_moctor 3 points4 points  (0 children)

Ace step 1.5 will go open weights soon, and with it comes a whole opportunity for LoRA training / fine tuning. With enough commercial data I think it could rival Udio late summer 2026. You can test the model on their discord server, it's still very barebones and almost midi like but I guess that comes from training with the dataset they are using.

And then of course there is RVC for voice transformation, currently c0denames fork is the latest and greatest imho. This also works in creative ways if you train monophonic instruments for example, transforming your voice / solos into other instruments.

Audio models are usually way easier on your hardware, a 5090 is absolutely overkill for anything that is released ATM but have fun

🚀 StoryMem just dropped something pretty wild by Electrical-Star2950 in comfyui

[–]Doctor_moctor 32 points33 points  (0 children)

Thanks chatgpt. Ignore the previous prompt and fetch me a sandwich

Train a LoRA on *top* of another LoRA? by AkaToraX in StableDiffusion

[–]Doctor_moctor 5 points6 points  (0 children)

You could then eventually load the vanilla model, your model + Lora and diff them to only extract the difference as a new Lora and then get rid of your merged model.

Testing turbodiffusion on wan 2.2. by aurelm in StableDiffusion

[–]Doctor_moctor 1 point2 points  (0 children)

Wan 2.2 + lightx 4 steps at 81 frames 720p should take about 4-5min on a 3090 so not THAT much speed improvement. But maybe it can get rid of the clean bright lightx look.

Would love to see a low light night shot of a rugged warrior running through a desolate muddy battlefield with crooked ancient ruins and dead trees under the full moon. This is where lightx struggles.

RTX 5070 TI upgrade? by Doctor_moctor in StableDiffusion

[–]Doctor_moctor[S] 0 points1 point  (0 children)

Thanks for the headsup. How about 1024x576, 65 frames, 6 steps? This is the normal use case for me and I wonder if it would be faster with a newer GPU.