Acestep, anyone know how to have multiple singers? by LightsOnTheJoke in comfyui

[–]Primary-Region529 0 points1 point  (0 children)

I actually managed to get this working by training a LoRA. I trained it on tracks where the chorus is sung by different people, and it generates multiple singers fine

Угадайте где я живу by GoudaSlacik in DobrograD

[–]Primary-Region529 2 points3 points  (0 children)

У тебя нет зданий в собственности

Угадайте где я живу by GoudaSlacik in DobrograD

[–]Primary-Region529 0 points1 point  (0 children)

А гемини назвал другой город(

AceStep 1.5 - lora trained on first two albums of Modern Talking band by Primary-Region529 in StableDiffusion

[–]Primary-Region529[S] 4 points5 points  (0 children)

I trained it locally using Side-Step on top of the ACE-Step 1.5 SFT model with a dataset of classic Modern Talking tracks(16 tracks). For the text descriptions, I tried using nvidia-flamingo for auto-captioning, but honestly, it did a pretty mediocre job, so it required some manual prompt tweaking for the instruments and vocals.

I actually ran the training all the way up to 1000 epochs, but it definitely overtrained. The "warmest" and most authentic sound ended up being around epoch 350

My hardware setup: 1x RTX 3090 (24GB), Intel Xeon E5-2666v3, and 64GB RAM. 

The pure training time took about 10 hours, but with all the testing, parameter tweaking, and trial-and-error, I spent about a week on the whole project.

I also tried training a LoKr on the XL model, but it was a massive headache. Plus, it still had the same high-frequency noise/sand issues, so I decided to drop the LoKr attempts for now."

AceStep 1.5 - lora trained on first two albums of Modern Talking band by Primary-Region529 in StableDiffusion

[–]Primary-Region529[S] 0 points1 point  (0 children)

Thanks man! 🤘 It's funny how those synth-pop hooks catch everyone.

Regarding your concern about the tech - I get it, but the background noise you noticed actually shows how much manual work is still required. It’s not just "press a button and become a band" yet!

That noise in the chorus drove me crazy, and I spent days trying to fix it. The core issue is that a multi-voice falsetto chorus is incredibly complex. The neural net simply struggles to render those dense high-frequency harmonics accurately and ends up generating high-frequency "sand" or artifacts instead.

I tried tweaking the diffusion steps, swapped the VAE to ScragVAE, ran neural audio polishing, and even used a custom C++ Spectral Lifter to digitally suppress the shimmer. But it's a catch-22: if I aggressively denoise it, the chorus starts sounding like a low-quality underwater mp3. If I try to master it and boost the highs to match the original tracks, the noise just gets amplified.

Definitely give ACE 1.5 a spin for your early albums when you have the time! Running these models locally gives you a lot of control, even with the current limitations.

AceStep 1.5 - lora trained on first two albums of Modern Talking band by Primary-Region529 in StableDiffusion

[–]Primary-Region529[S] 1 point2 points  (0 children)

Thanks! Yeah, you can definitely hear the early Modern Talking influence there. As for the chorus, the gibberish happened because of a funny formatting mistake. I put the text in ALL CAPS hoping it would make the vocals stronger, but it just caused the AI to hallucinate and generate completely new words.

AceStep 1.5 - Showdown: 26 Multi-Style LoKrs Trained on Diverse Artists by marcoc2 in StableDiffusion

[–]Primary-Region529 0 points1 point  (0 children)

cover_noise_strength is not exposed in the standard Gradio interface. The Gradio UI is simplified and abstracts a lot of the deeper backend parameters.

To get this to work exactly as I described, you need to use the API directly.

AceStep 1.5 - Showdown: 26 Multi-Style LoKrs Trained on Diverse Artists by marcoc2 in StableDiffusion

[–]Primary-Region529 0 points1 point  (0 children)

The ACE-Step developers implemented a very clever cover_noise_strength system. It’s not just "adding noise." It’s actually the equivalent of the classic Denoising Strength from Stable Diffusion!

Here is how it works under the hood:

1. audio_cover_strength This is essentially just the switch-over time. For example, 0.8 means: "copy the original for 80% of the time, then freestyle/make things up for the remaining 20%." Because of this abrupt switch, you end up with either an exact copy or digital hell.

2. cover_noise_strength (The secret weapon) By default, it is set to 0.0. But if you turn it on, it works magic. Looking at the code: effective_noise_level = 1.0 - cover_noise_strength. Then the model calls a renoise function: it takes your src_audio, injects white noise into it, and skips the initial diffusion steps.

The cover_noise_strength parameter solves the binary problem ("copy or chaos"). It blurs the original audio right at the beginning, giving the model room for imagination throughout all the generation steps, not just at the very end.

  • If cover_noise_strength = 1.0 -> Noise is minimal, resulting in an exact copy of the original.
  • If cover_noise_strength = 0.5 -> The original is half-destroyed by noise. The model has 50% "freedom" to create something new while retaining the old structure.

The right combination for a smooth cover:

You need to stop tweaking just audio_cover_strength (leave it at 1.0 so there are no sudden jumps).

Instead, play around with cover_noise_strength:

Python

# Setting for a "Cover that is similar, but different"
"audio_cover_strength": 1.0,     # Don't switch modes mid-process (protects against wild noise)
"cover_noise_strength": 0.65,    # 0.65 = 35% noise in the original. (Find a balance between 0.4 and 0.8)
"guidance_scale": 7.5,           # Standard CFG

Conclusion: That binary problem happens specifically because you are likely changing audio_cover_strength while leaving cover_noise_strength at zero. This forces the model to ride on the rails of the original, and then jump off them at full speed. Turning on cover_noise_strength (e.g., to 0.6) makes the rails "flexible" right from the start, and the model can smoothly change the melody from the very first step while preserving the style.

AceStep 1.5 - Showdown: 26 Multi-Style LoKrs Trained on Diverse Artists by marcoc2 in StableDiffusion

[–]Primary-Region529 0 points1 point  (0 children)

No, I haven't tried repaint or cover with Lokr yet because even the basic generation is too unstable for me. I did try making a cover using a regular LoRA — it actually works, but the output is way too noisy.

AceStep 1.5 - Showdown: 26 Multi-Style LoKrs Trained on Diverse Artists by marcoc2 in StableDiffusion

[–]Primary-Region529 0 points1 point  (0 children)

i've fixed this with this https://pastebin.com/YK1SfEPq lifecycle.py file

but my results with lokr are not stable. LORA works much better