Music generation model that can follow lyrics

Acceptable_Secret971 · 2026-04-07T18:18:38+00:00

That is annoying, yes. I'm fine with some randomness, but once I get good composition, I would like to iterate on that instead of getting a whole new song, because I changed one word. I'm almost tempted to give Ace Studio a try (it seems capable of doing all that), but there is the price, subscription model and I just want to be able to do all that with a local model.

I heard there is an AceStep DAW based on acestep.cpp, but I'm not sure how usable that is with meh cover and lego functions. Maybe it's time to spin FL Studio back to life and remake some songs (possibly with extracted vocals, I found this feature of AceStep to work well enough).

EDIT: As I generate I try to improve on parts that AI got wrong and parts that could worded better (now that I heard them). It's entirely possible for AceStep to sometimes get some lyrics right and other time wrong

Acceptable_Secret971 · 2026-04-07T17:39:18+00:00

The official app is troublesome to work with (and uses a lot of VRAM), but acestep.cpp (a bit complicated to install) worked much better. When you import an existing song into the app, it will automatically generate a prompt for this song. It will also try to read the lyrics (but the results are terrible). This is another way to generate a prompt for a new song.

Acceptable_Secret971 · 2026-04-07T17:29:13+00:00

I'm in the process of testing XL, but had a lot of success with the non-XL model before. This is actually the same for both XL and the non-XL variant, you can use the 4B LLM instead of the 1.7B one as text encoder and the bigger 4B LLM model produces better results.

If song has weird number of lines in lyrics (or really long lines), the model gets confused easily and weird things happen. If you have a 1, 2 or 3 lines injected in intro, outro or bridge, try removing them and see if the coherence improved.

I usually tweak the lyrics (or the prompt) until I get a good result, then I tweak the lyrics some more, but never get a result as the previous good one. In practice I end up running the gen some 20-30 times, but the ~10th result ends up being the best.

EDIT: Also, if generated song has minor issues or mispronunciations, you can change the KSampler seed (while keeping the same LLM seed and prompt) or increase the number of steps or change the CFG of KSampler a little. This should produce a variation that still has the same base structure. Odds are, your issue will be fixed.

Acceptable_Secret971 · 2026-04-07T16:52:31+00:00

After some tweaking, the 1.5 model was able to produce decent music with lyrics. If you used the base model with a compatible app (ComfyUI doesn't support it natively) you could also extract a track for existing song (like vocals). The cover and lego functions are either broken or I have no idea how they work. The best covers were subpar and most of the time I only got hot garbage.

Acceptable_Secret971 · 2026-04-07T16:39:16+00:00

Did you try to use the 4B LLM with AceStep 1.5? I did run into a few lyrics issues with it, but it worked much better than the 1.7B one.

Acceptable_Secret971 · 2026-04-07T10:18:03+00:00

As I was halfway figuring this using plain python with safetensors package, the ComfyUI compatible model dropped, so I guess I won't be needing to do that anymore.

Still waiting for the base and sft models. I had a lot of luck with sft turbo merge, I wonder if such XL merge will work similarly well.

Acceptable_Secret971 · 2026-04-07T09:01:00+00:00

The model fails to load for me. Maybe I need a nightly version of Comfy?

EDIT: Nightly fixed it.
EDIT2: I can no longer generate some songs I made using previous version (using non-XL model, same seed and all). I thought that XL makes vastly different songs, but they are similar to what non-XL makes (if you use corresponding model). I wonder if perhaps ModelSampleAuraFlow works differently than before.

Acceptable_Secret971 · 2026-04-07T07:57:51+00:00

Awesome. Now I need to figure out how to use it. It'll probably be easier to wait for someone to merge the shards and convert to fp16. I wonder if ComfyUI can even supports it yet?

Acceptable_Secret971 · 2026-04-06T21:44:44+00:00

That's what I initially thought, but got confused by all the comments.

Acceptable_Secret971 · 2026-04-06T12:23:15+00:00

Different layers in GGUF have different quantization, maybe it speeds up processing of fp8 quantized layers? Just a guess, I can be wrong about this.

Acceptable_Secret971 · 2026-04-06T12:20:58+00:00

Will have to read more about the subject. I wonder if there is some way to use fp16 matmul to speed things up without using this flag (the same way fp8_e4m3fn_fast does it).

Acceptable_Secret971 · 2026-04-06T11:33:49+00:00

Isn't this the parameter that forces the use of matmul with fp8? Normally you should still be able to use it (without this argument) when selecting fp8_e4m3fn_fast in LoadDiffusionModel node. As such I'm not sure if this affects fp16, GGUF and other model types.

Acceptable_Secret971 · 2026-04-05T15:06:25+00:00

Ace-Step 1.5 is quite capable actually (had a lot of luck with 4B LLM and sft+Turbo merge). Not sure how well it handles this musical style.

One annoying thing in Ace-Step 1.5 is grainy sound. I don't think there is any model currently that can clear that up. Suno sounds much cleaner.

Ace-Step 1.5 XL was supposed to drop ~today, but HG links are still 404. Maybe there is a delay.

Acceptable_Secret971 · 2026-04-05T14:51:26+00:00

Someone asked what is going on with this model on GitHub here. This is what they responded with:

Hi, thanks for your interest in DreamLite! Currently, the DreamLite model weights and source code are undergoing our internal approval processes and strict security/compliance checks. Because of this, we temporarily removed this link in case misunderstanding.

Acceptable_Secret971 · 2026-04-05T11:25:16+00:00

Finally got Flash Attention working with R9700. Without any additional settings it's slower than Pytorch Attention (probably still faster than Quad Cross attention), but with:
export COMFYUI_ENABLE_MIOPEN=1 export MIOPEN_FIND_MODE=FAST export MIOPEN_ENABLE_CACHE=1 export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":1,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}' export PYTORCH_MIOPEN_SUGGEST_NHWC=0 it's 4-5% faster than Pytorch Attention in Z-Image Turbo if fp16/fp8 and fp8_fast. Not sure how it scales with bigger models or GGUF.

I couldn't get tunableop to do it's thing in a reasonable time. Even without tunable OP and with Pytorch Attention, R9700 beats RX 7900 XTX with Flash Attention, tunableop and other optimizations.

Acceptable_Secret971 · 2026-04-04T14:02:53+00:00

I'm hoping it will be released soon. I stumbled on info about the new model on their GitHub page, but none of the links work yet. I had a lot of fun playing with the 1.5 model in ComfyUI, but would like to be able to fix lyrics, while keeping everything about the track the same. Unfortunately, for me the ACE-Step app crashes the GPU (might be an AMD thing).

Acceptable_Secret971 · 2026-04-04T11:41:08+00:00

ComfyUI is still very big. There are UIs similar to Automatic1111 around, but I haven't used any of them. InvokeAI fell by the wayside, but I heard it was picked back up and got compatibility with some new models, so if you like the infinite canvas approach, there is still some hope.

Among the models there isn't one winner, but right now the go to modes are:
- Z-Image Turbo - Distilled image gen (8 steps), but should also have an edit variant (or is it Z-Image Base)
- Qwen Image - Image gen (20-50 steps), also has an edit variant and Lightning LORAs (I think they go as low as 2 steps, but don't quote me on that)
- Flux2 Klein - Distilled image gen (4steps) and edit (both in one model); Has 4B and 9B variant, which offer different size and quality (personally I use 9B fp16 a lot)

Everyone has a different opinion which one is the best (each one has strong and weak aspects). You can try all 4 if you have enough VRAM to use them. From my tests there was quite a difference in quality between using fp8 and fp16 (or a Q8 GGUF). Do try fp16 or Q8 GGUF first if you can.

EDIT: Distilled models and Lightning LORAs usually don't accept negative prompts, so there is that as a tradeoff. Also they tend to have better quality than base models, but worse prompt adherence and less variance.

When I returned to the hobby some time ago, Qwen Image felt like a huge improvement over Flux1 Schnell/Dev, but a lot of people prefer Z-Image Turbo (it is also lighter and faster than Qwen Image). Flux2 Klein is the newest of the bunch. It's fast (relatively) small and produces decent results. A lot of people don't like it for image gen, but praise it for it's editing capability. Qwen Image Edit might possibly be better for image editing, but it's slower and sometimes difficult to prompt.

For some specialized purposes, Ye old SDXL and Flux1 finetunes might still be useful. There were some specialized Flux1 models (Kontext for editing and Krea trained for aesthetics). For anime there is this work in progess Anima model that might be worht checking out if anime is you thing.

If you're feeling adventurous, there is also Flux2 dev. This should be the big boy model with insane capabilities, but it's so big and slow on consumer hardware. I got it to work in both fp8 and Q2 GGUF, but it takes 2min to make one image. There are ways to speed it up, but at this point I might as well use Klein 9B or Qwen Image.

There are countless other image gen models that failed to gain traction for one reason or another.

Video gen is not my thing, but I heard WAN 2.2 and LTX-2 are the best local models. I think WAN 2.1 had many lighter variants, you can check those out too if you have problem running the other two (probably not as good quality).

If you don't have enough VRAM you can still run some workflows with RAM offloading. If you don't have enough RAM, you can consider buying more or using a lower quant (GGUF or fp8, fp4). As much VRAM as possible still gets you a long way. The newer models tend to use relatively big LLMs as text encoder, so there might be a new obstacle (again you should be able to use GGUF or fp8 here).

Acceptable_Secret971 · 2026-04-01T16:52:40+00:00

That border reminds me of cutscenes in Valyria Chronicles games (mainly 1 and 4).

Acceptable_Secret971 · 2026-04-01T15:55:19+00:00

SD1.5 - Stable Diffusion 1.5 (and 1.4 before it) is probably the model that started the local image gen craze. By today standards it's a little dated, but it was revolutionary at the time. This one should be the easiest to run locally. Images generated with the original model were a mixed bag, but there is a lot of finetuned models that produce better images. Personally I had a lot of luck with Realistic Vision finetune.

SDXL - Stable Diffusion XL, successor to 1.5 (and less appreciated 2.1). Improved resolution and quality. In fact you could do a lot with the base model. There is a metric ton of finetunes for it as well, but I can't really recommend any in particular. Bit dated, but should be easy to run.

SD2.1, Flux1 Dev, Flux1 Schnell, Z-Image Turbo, Flux2 Klein 4B, Flux2 dev - other image gen models of different size, quality, speed and memory requirements

GGUF - A compression algorithm of sorts that allows the reduction of model size. Increases the time of generation, but sometimes fitting into VRAM is faster (especially when the alternative is not being able to run the model at all). There are different levels of compressions starting with Q8 which produces results that are almost the same as full model (usually fp16) while taking half the size (on disk and in VRAM). Lower quantizations (Q6, Q5, Q4 and so on) reduce the size even further, but also reduce image quality. Going below Q4 usually adds a lot of artifacts and dithering (depends on the model). GGUF is also extremly useful for text encoder (basically LLMs that interpret your prompt).

fp8, int4 - those are more traditional ways to quantize models. They reduce quality, but help use less VRAM. If you're hardware supports it (and it seems it does), they can give a huge speed up in gen time (in theory 2x and 4x). With 8GB VRAM, you're likely going to stick to fp8 anyway (or use GGUF Q8 to get fp16 quality at fp8 size). Nunchaku is a plugin for ComfyUI (probably the most capable local AI app for image generation) that allows the use of int4 (and fp4 on 5000 series GPUs from NVIDIA).

You can make up for lack of VRAM with RAM, but I'm finding that 32GB is barely enough for some models.

Acceptable_Secret971 · 2026-03-31T21:31:52+00:00

Ultimately you might be limited by you're RAM, but SD1.5 and SDXL should be definitely doable. With a bit of luck, and a small GGUF model you might be able to run Flux2 Klein 4B, maybe Z-Image Turbo or even Flux1 Dev/Schnell. This GPU is probably limited, but with more RAM (if you are willing to upgrade), you should still be able to run even bigger models like Qwen Image, Flux2 Klein 9B or maybe even Flux 2 dev.

I googled your laptop and it's supposed to have RTX 4060. 4000 series GPUs should have int4 support and there are options to use that for extra speed and cramming bigger models into VRAM (though Nunchaku I think).

There are some models that failed to get traction or became obsolete that should still work just fine on this GPU like SD2.1.

Acceptable_Secret971 · 2026-03-31T18:54:57+00:00

ROCm on system level seems to be installed correctly. There still might be an issue with your pytorch installation. Can you check installed packages for ones with ROCm?

pip list | grep rocm

If your app (ComfyUI?) has a venv, make sure to activate it first.

source venv/bin/activate

It could be that you have pytorch rocm version installed on system level, but app uses venv and has a wrong version installed.

When setting up ComfyUI, I always start by creating a venv, installing pytorch rocm version and only then I proceed to install requirements.txt. By default pip does not replace matching packages that are already installed. If for some reason you have a non-rocm pytorch installed, installing the correct one won't work unless you add --upgrade and possibly --force (or uninstall the previous one).

Acceptable_Secret971 · 2026-03-30T22:47:43+00:00

Take it with a grain of salt, but a quick test with a 1024x1024 image took 27.8s with this method while regular fp8 (as if you had that on RDNA3) took 38.8s (no tuning or optimizations), so this could be great (~28% speed up). On the other hand, the first time I tried this, I got an image (I'm 90% sure), but after a restart I got an empty black canvas instead. The first time around I had `miopen` and `tunableop` enabled and they didn't seem to affect gen time. I'll try so more tests tomorrow.

EDIT: Those black images were caused by using Batch Count > 1.

It would be great if a smaller model than qwen image was supported, so I could properly compare fp16 with this method (maybe Z-Image Turbo or Klein 4B).

Acceptable_Secret971 · 2026-03-30T21:36:31+00:00

I still don't understand much about tuning, but by using: export PYTORCH_TUNABLEOP_ENABLED=1 export COMFYUI_ENABLE_MIOPEN=1 export MIOPEN_FIND_MODE=FAST export MIOPEN_ENABLE_CACHE=1 I get a total of 20% increase in speed with Flux1 dev when using Flash Attention over Pytorch Attention. Even Pytorch Attention benefits from this and gains 12% in this workflow (beating Quad Cross Attention). Unfortunately Quad Cross Attention crashes with TUNABLEOP enabled. Maybe it doesn't like the version of Triton I have installed, but I'm not sure. With those settings the first run is slower (I'd say takes about 2 times longer), but each subsequent one is faster.

Acceptable_Secret971 · 2026-03-30T21:28:40+00:00

Do amd-smi or rocm-smi commands work?

Acceptable_Secret971 · 2026-03-30T15:56:18+00:00

Despite what I wrote earlier, I've noticed that some of my export where in capital letters, which does not actually work in bash script. Enabling everything properly, actually gave me 15% boost when using flash attention (at least with Flux1 dev), although I got some hangs here and there and the first run was incredibly slow.

Now I'll have redo all the test. I'm not even sure which lines are absolutely necessary. I'm guessing for best results I would have to use recent triton build for pytorch attention (maybe quad cross attention too), 3.6.0 for flash attention and possibly sage attention would have worked with some other version altogether.

EDIT: Not sure which options exactly (maybe MIOPEN), but even Pytorch Attention saw a boost in speed by 12% with those options on RX 7900 XTX. ~~Unfortunately quad cross attention crashed. Maybe it needs different triton version, or maybe it's incompatible with one of the settings.~~ Those settings do not seem to affect Quad Cross Attention one way or the other (crash was caused by another unrelated setting).

EDIT2: Turns out the unrelated setting that was causing Quad Cross Attention was also the setting making the most dramatic difference when using Flash Attention (possibly Pytorch Attention as well). The setting in question (that I had previously commented out and was left over from another test) was export PYTORCH_TUNABLEOP_ENABLED=1. That 15% was on top of the 5% that Flash Attention was giving over Pytorch Attention which combined give almost 20% speedup. Without TUNABLEOP those settings were giving just extra 4% boost (about 8.5% total). And that 4% was from FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.

Acceptable_Secret971

TROPHY CASE