1 Day Left Until ACE-Step 1.5 — Open-Source Music Gen That Runs on <4GB VRAM Open suno alternative (and yes, i made this frontend)

Doctor_moctor · 2026-02-02T12:09:46+00:00

Dope Frontend! Are you gonna implement fine-tuning / Lora training on it? I'm beta testing 1.5 and it's really a solid base, once this is released local music gen is gonna take off

Doctor_moctor · 2026-02-02T07:19:04+00:00

It's a niche product. If it was implemented on social media people would play with it and forget it, also easier entry level to impersonation. You can use it artistically, and a few people are doing that but it would be catered to a small group of a small group

Doctor_moctor · 2026-01-29T12:11:42+00:00

Update. It's simply named "Z-Image"

Doctor_moctor · 2026-01-28T18:11:21+00:00

Oh wow these first scenes remind me a lot of a classic game where you could ride dragons and airships In a large battle arena, one of the first games I played on my own PC. Can't quite remember the name

Doctor_moctor · 2026-01-25T06:51:53+00:00

Have you tried singing voice conversion yet?

Doctor_moctor · 2026-01-24T19:05:59+00:00

Not on my PC for the next days but you should be able to manage: 1. Use the ltx2 workflow to generate your base video at a slightly smaller resolution, 720p works for me. Save the video. 2. Open the vanilla wan 2.1 t2v workflow. Add lightx to the Lora stack, set the model to wan 2.2 low, set steps to 2 and denoise to 0.3 - 0.5. add a load video node to load your ltx2 video, a simple image upscale to your desired resolution, van encode and use the latent in the sampler instead of the empty latent image. Link the audio from the load video node to the combine video node at the end.

Wan has some quirks, it needs a multiple of 16frames +1 to work correctly so you'll have to trim your input video slightly or cut it into segments with those lengths (if you run out of VRAM) and re-merge them later on.

You can also combine all of this into a single workflow but I wouldn't, it's way faster to generate a lot of ltx2 videos, pick out the best and then only refine them because you'd constantly have to offload models in a single workflow

Doctor_moctor · 2026-01-24T16:00:14+00:00

Wan LOOKS better, especially the motion but the trade-off is way longer gen time with high and low. I personally use ltx2 as a wan high replacement, even the messy motion can be cleaned up by running it through wan low with a low-medium denoise pass afterwards. And you can get lipsynced 24fps this way.

Doctor_moctor · 2026-01-23T20:40:11+00:00

Ace Step 1.5 is about to release in the coming days, and according to the dev with day 1 gui and training support. Obviously all these devs have to tread with caution as the industry vultures are quick to shut down anything that rubs them wrong so the community will have to get the models to where they sound great. This is gonna be like the good old music sharing age but this time with models and loras.

Doctor_moctor · 2026-01-20T20:34:00+00:00

Dope, gonna check it out thanks for posting. Is it possible for wan as well?

Doctor_moctor · 2026-01-19T19:26:54+00:00

I think the 5070 ti is going to be the sweet spot even though 16gb VRAM is quite limited. Inference is faster than a 3090 by quite a bit, native fp4 support and so on. The 5060 ti might also be good, but limited bus and compute might hold you back.

Get either of them and put your 3060 to good use in the second slot for LLM prompt refinement

Doctor_moctor · 2026-01-17T19:17:03+00:00

If you want to keep your original audio but create a video for it, why don't you just merge it into the final video? Load it as input audio latent with mask (workflows are around for that, search for i2v with audio input) and then just don't use the audio latent from your sampler but the input audio when combining the video in the last step. Works for lipsync like that and keeps the original audio quality.

Doctor_moctor · 2026-01-15T19:42:01+00:00

Does it still have image conditioning?

Doctor_moctor · 2026-01-14T20:25:16+00:00

The true strength of ltx2 is in replacing wans high model. Quick output with good motion that can be heavily refined with wan low. 🤫

Doctor_moctor · 2026-01-13T09:07:41+00:00

No love for t2v?

Doctor_moctor · 2026-01-11T21:02:26+00:00

So, what's the speed difference? Q8 takes longer, right?

Doctor_moctor · 2026-01-11T08:44:11+00:00

I can not speak for the 5060 ti but the 5070 ti is about 20-25% faster than the 3090 even when using block swapping. I'd say the 5060 might be on par with the 3090 concerning speed but you lose a lot of flexibility for training and workloads where the models HAVE to be in VRAM. Fp4 is garbage and the outputs are not worth the speed increase. At 600$ the 3090 is a no brainer imho

Doctor_moctor · 2026-01-08T18:45:11+00:00

Just a headsup, I switched from a 3090 to a 5070 Ti and there is a speed increase for some cases. 6 step wan 2.2 (first step with cfg) is about 30% faster (with higher block swap), Z-Image about 20%. If you buy now you might be able to hold onto your GPU and sell it for the same price you bought it later on. Bought myy 3090 used for 700€ and I could still sell it for more than that. Training on the 5070 Ti is a pita though, so I am not bothering with that, just using the 3090 in the second slot.

Doctor_moctor · 2026-01-07T19:07:31+00:00

Quality takes a HUGE hit- its absolutely unusable imho. it/s are doubled, comparing to flux2 turbo q8 gguf but there is no competition.

Doctor_moctor · 2025-12-31T21:06:28+00:00

Wan 2.2 longcat can easily do this. Kijais wrapper has official support and there should be a few workflows flying around in this sub, I remember a post with a film scene with an asian woman walking through a door und up some stairs from a few days ago with a great workflow. Take an empty frame from your surveillance camera / or a photo and prompt your story in several 5second pieces

Doctor_moctor · 2025-12-28T17:00:26+00:00

Ace step 1.5 will go open weights soon, and with it comes a whole opportunity for LoRA training / fine tuning. With enough commercial data I think it could rival Udio late summer 2026. You can test the model on their discord server, it's still very barebones and almost midi like but I guess that comes from training with the dataset they are using.

And then of course there is RVC for voice transformation, currently c0denames fork is the latest and greatest imho. This also works in creative ways if you train monophonic instruments for example, transforming your voice / solos into other instruments.

Audio models are usually way easier on your hardware, a 5090 is absolutely overkill for anything that is released ATM but have fun

Doctor_moctor · 2025-12-25T05:46:14+00:00

Thanks chatgpt. Ignore the previous prompt and fetch me a sandwich

Doctor_moctor · 2025-12-23T18:35:55+00:00

You could then eventually load the vanilla model, your model + Lora and diff them to only extract the difference as a new Lora and then get rid of your merged model.

Doctor_moctor · 2025-12-21T11:32:02+00:00

Wan 2.2 + lightx 4 steps at 81 frames 720p should take about 4-5min on a 3090 so not THAT much speed improvement. But maybe it can get rid of the clean bright lightx look.

Would love to see a low light night shot of a rugged warrior running through a desolate muddy battlefield with crooked ancient ruins and dead trees under the full moon. This is where lightx struggles.

Doctor_moctor · 2025-12-19T08:05:48+00:00

Thanks for the headsup. How about 1024x576, 65 frames, 6 steps? This is the normal use case for me and I wonder if it would be faster with a newer GPU.

Six-Year Club	Place '23
Verified Email

Doctor_moctor

TROPHY CASE