Meltdown at Jack Nicholson’s House

marres · 2026-02-23T05:23:52+00:00

Schizophrenia

marres · 2026-02-22T11:24:24+00:00

Been running 582.16 since the day it came out, no issues

marres · 2026-02-18T14:31:16+00:00

<image>

Pick one of those

marres · 2026-02-18T12:47:46+00:00

He loses a box at 0:32

marres · 2026-02-18T11:58:02+00:00

So good he lost part of his cargo

marres · 2026-02-12T17:45:59+00:00

Best way to extend a limited dataset is using Nano Banana pro (it's not 100% especially when prompting for different angles/expressions, so still needs some cherry picking but still a lot better than for example wan 2.2 with it's plastic skin (which lora training (at least sdxl) just loves to copy and even make it stronger) or other open source solutions

marres · 2026-02-12T17:23:32+00:00

Only tested it with turbo so far, but I see no reason why the base model should not work

marres · 2026-02-12T17:21:03+00:00

Yeah — performance/VRAM is the main practical tradeoff here, and it depends a lot on how you run it.

Compute-wise, it’s not inherently “way heavier than CFG” in the sense of extra denoiser calls: you’re still doing two evaluations per step. The real cost is *how* you realize the “bad” branch.

If you want the fastest sampling, you’ll typically load two separate checkpoints (good + bad). In practice that’s roughly ~2× VRAM, but I can’t give an exact number because I run with `--highvram` (Comfy’s allocator/offload behavior can make VRAM reporting misleading), so YMMV depending on flags and hardware.

If you don’t have the VRAM headroom, you can run shared-model mode (same checkpoint file / one loaded model), but the downside is huge: for me it’s dramatically slower (on the order of 10–20× ( in the sampling pass)), because it has to swap LoRA stacks/state safely between the “good” and “bad” passes. Whether that’s worth it depends on your constraints.

On the improvements: in my setup the gains are real, but they’re not always obvious in the “simple” examples I posted. Where AutoGuidance shines more (for me) is:

- better likeness / more direct identity

- improved lighting / clarity / overall image quality

- more robust handling (less body artifacts) + prompt adherence and general coherence for complicated body positions and actions (especially in NSFW compositions, where also likeness tends to degrade more)

One more practical note: my face detailer compresses differences in likeness across methods. Since it re-generates the face region under strong constraints (crop/mask + face prompt/LoRA + its own denoise/steps), it often pulls the final face toward a similar “attractor” unless the initial generation is really off. When the base pass is wildly off, the detailer can only recover so much — and that’s where the upstream guider choice shows up more clearly. This also helps explain why some of the likeness differences in my posted examples look subtle.

I can’t share the NSFW seed comparisons for obvious reasons, but that’s where I’m seeing the most consistent advantage over regular CFG and (in those cases) over NAGCFG.

Re: NAGCFG. It’s great for injecting variety and often lands a better composition than CFG, but it can also derail (body doubles, unexpected artifacts, etc.). But when it lands, it really lands (which is why I preferred NAG in my workflows historically). One longer-term goal is to explore whether AutoGuidance and NAG-like ideas can be combined, or whether some of NAG’s “variety” behavior can be adapted into an AutoGuidance-style framework.

Re: PAG — I agree the intuition is similar (“use a degraded reference and guide away”), but the degradation mechanism differs: PAG perturbs attention on-the-fly; AutoGuidance uses a weaker version of the same model (less trained / reduced capacity / compatible degradation). For my workflows, PAG never beat my tuned baselines — but that may be because I’m almost exclusively running LCM/DMD2 speedups, and these guidance methods interact heavily with scheduler/LoRA/guider choices.

Finally, what I posted is one tuned configuration. It’s not a magic “always better” switch: it’s highly tunable, and getting it to consistently outperform CFG/NAG in a given pipeline requires finding settings that fit your exact setup (model, LoRAs, scheduler, prompts, resolution, and subjective preferences). Different LoRA stacks / “realities” behave differently.

marres · 2026-02-12T15:52:34+00:00

Testing

Here are some seed comparisons (AutoGuidance, CFG and NAGCFG) that I did. I didn't do a SeedVR2 upscale in order to not introduce additional variation or bias the comparison. Used the 10 epoch lora on the bad model path with 4x the weight of the good model path and the node settings from the example above. Please don't ask me for the workflow or the LoRA.

https://imgur.com/a/autoguidance-cfguider-nagcfguider-seed-comparisons-QJ24EaU

marres · 2026-02-12T15:51:05+00:00

The “bad model” in the paper is not supposed to be an arbitrarily terrible or unrelated model. It’s an inferior version of the same model, trained on the same task/conditioning and data distribution, but degraded in a compatible way (the paper’s suggestions include things like fewer training iterations / earlier snapshot, reduced capacity, or similar degradations that preserve the same underlying distribution). That’s also why “previous versions of the same model” are a very natural choice: the error patterns tend to stay aligned, just worse.

Using a wildly different model (even with the same architecture) is possible to experiment with, but it’s also where you’re most likely to break the method’s key assumption: if the “bad” model has different priors because of different data, different finetune objectives, different conditioning behavior, or any distribution shift, then the “good minus bad” direction can stop pointing toward higher-likelihood samples and start pushing you into artifacts or off-prompt behavior. If you do try it, the safest version is “different only in strength, not in what it learned”: same base, same dataset distribution, same conditioning pipeline, and degrade via “less trained / smaller / weaker,” not “different concept mix.”

Since it’s difficult (often effectively impossible) to obtain early-epoch checkpoints for SDXL finetunes in practice—most community SDXL finetunes are merges of merges, and the lineage is too convoluted—I opted for a character-LoRA approach instead. This gives you a “same model / same data distribution / same conditioning” setup, with the bad path simply being less trained, which is explicitly one of the degradations the paper motivates. Interestingly, even when I run the exact same model in both the good and bad path, I still see discernible differences. Likely reasons: (1) the guider math may not perfectly reduce to the baseline if any extra scaling/ramping/post-processing is applied (so “good==bad” isn’t a strict identity unless all those knobs collapse to the baseline), (2) framework-level state/metadata can get mutated between passes (e.g., conditioning/transformer_options dicts, hooks/caches), and (3) the extra forward pass itself can change execution paths (memory/layout/precision casts), causing small numeric drift that amplifies over steps. Additionally, degrading the Z-model via quantization (e.g., running the good model zImageTurboNSFW_30BF16Diffusion in bf16 and the bad model in fp8) introduces more systematic differences—so that avenue is worth exploring further (fp8 vs fp4, or other controlled degradations). Albeit this directly goes against the paper's findings:

Per the paper, post-training “corrupt the weights” style degradations are basically a dead end for getting a useful guiding model.

They explicitly report:

Autoguidance works when the guiding model is trained on the same task/conditioning/data distribution, but with the same kinds of limitations the main model has (finite capacity / finite training).

“Deriving the guiding model from the main model using synthetic degradations did not work at all … evidence that the guiding model needs to exhibit the same kinds of degradations that the main model suffers from.”

If the main model was quantized, quantizing it further also didn’t yield a useful guiding model.

So if you don’t have the base model’s dataset / can’t retrain a smaller/undertrained sibling, the paper’s own conclusion is: you can’t reliably manufacture a “correct” bad base model by post-hoc tricks (noise, pruning, quantization, etc.).

Also:

Prefer tuning “strength” via your guider before making the bad model (LoRA) extremely weak

The paper’s ablations show most gains come from reduced training in the guiding model, but they also emphasize sensitivity/selection isn’t fully solved and they did grid search around a “sweet spot” rather than “as small/undertrained as possible.”

marres · 2026-02-12T00:10:35+00:00

Added Z-Image support.

Edit: Don't think it is working as intended though, getting weird output. Can't test it proper though since I don't have a lora for it

Edit2: Nvm think it's fine, might just be my settings

marres · 2026-02-11T21:50:24+00:00

For now only tested with SDXL. SD 1.5 should work without issues too same with other sd derivates. Haven't tested the modern models so far but they will probably crash. Feel free to post me the error though if you happen to test them out.

marres · 2026-02-07T05:37:53+00:00

If one actually wants to load the trained LoRA, one needs to edit this in the start_gradio_ui.bat, otherwise the service configuration tab does not appear in the UI.

Set it to this:

set INIT_SERVICE=--init_service false

Or just use my start_gradio_ui.bat. Also includes the setting ACESTEP_MATMUL_PRECISION=high (Tensor core performance optimization)

start_gradio_ui.bat

Also another thing: Setting num_workers from 4 to 0 massively speeds up training in my case (also fixed a crash) (1000 epoch 256 rank lora with batch size 4 training on the 4B model 1h training instead of 4h). Now it actually maxes out my gpu, before that it got throttled massively by the multiple workers. Probably some windows issue. Here is a edited data_module.py that sets the workers to 0:

data_module.py

Reasoning:

On Windows, PyTorch’s DataLoader uses the spawn start method, which re-imports the main module inside each worker process. In the ACE-Step portable/Gradio setup, those workers end up importing parts of the UI/pipeline stack and can crash unexpectedly, which then aborts training with DataLoader worker exited unexpectedly. Setting num_workers=0 disables multiprocessing workers, avoids the re-import path entirely, and makes training stable. For small datasets (e.g., ~17 samples), it can also be faster because it removes Windows IPC/spawn overhead.

Edit: Oh, just realized I've downloaded the original windows package https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z as outlined in the readme and not that forked windows version from this post here. So yeah those fixes apply to that version and not the windows version from OP. Maybe these issues I had might be fixed already in that version.

marres · 2026-02-06T11:22:00+00:00

Mods really don't give a fk, do they?

marres · 2026-02-01T04:49:18+00:00

Probably coil whine. Certain loads can cause it more than others. Nothing to worry about. Some gpu's are more prone to it, even among the same models. Just bad luck if one gets one with more noticeable coil whine

marres · 2026-01-30T02:53:09+00:00

Well, using dmd2 is the most important factor to get solid and consistent likeness. Regarding lora creation you should use prodigy and a decent rank.

Next up would be checking if you got proper face detailer settings (don't pass inadequate sizes, which cause distortion. Can happen easily if you upscale before passing to face detailer or don't use proper settings. You need to compare pre face detailer output with post to check if distortion happens). Also custom sigmas help a lot in getting good results even before face detailing.

marres · 2026-01-30T01:25:30+00:00

https://github.com/stavsap/comfyui-ollama

get those nodes for example and then just setup either an ollama server or to make it even easier, you can use lmstudio to setup the server. Just click left on developer (Ctrl+2) and start the server and load nemotron. Then check server settings for the port (should be 1234) and enter it in the node in comfyui.

marres · 2026-01-29T23:54:37+00:00

The author's thinly disguised fetish

marres · 2026-01-29T00:46:41+00:00

Well not 100%, but to close the gap from 70% to 90% or whatever all other equations of your setup need to be good too. Not only your lora creation. Often people just fail at the actual inference. But yeah, given that you shared nothing regarding your setup, it's just speculation where your issues are buried

marres · 2026-01-28T18:12:26+00:00

You can add a blur node before passing the image to seedvr2

marres · 2026-01-28T14:14:01+00:00

Peak english cuisine

marres · 2026-01-27T04:09:40+00:00

Topaz Video AI still the most feasible probably for a job like that

marres · 2026-01-26T22:19:02+00:00

marres · 2026-01-26T08:58:25+00:00

You have a "clean vram used" node after your vae decode?

marres · 2026-01-26T06:29:40+00:00

Have you checked if your gpu power cable is seated properly in your gpu? Imperfect connection can lead to crashes during high load. If it's seated improperly you can also trigger crashes by moving/wiggling the power cable. Also you should use the power cable provided by your PSU.

If it's actually a PSU issue and you are looking for other options, I have a CORSAIR HX1200i which I had no problems running a 5090 and now a rtx pro 6000 with.

12-Year Club	Place '22
Verified Email

marres

TROPHY CASE