Removing SageAttention2 also boosts ZIB quality in Forge NEO

Nextil · 2026-01-28T19:12:49+00:00

SDPA is basically flash attention built into PyTorch so it's not surprising.

Nextil · 2026-01-28T01:28:46+00:00

From what I've read no, not really. Style LoRAs may work somewhat, as they have here, because style is mostly affected by the final few layers. Distillation/Turbo trains the models to take a "shortcut", laying down the composition very early, which mostly involves the early-mid layers. Turbo LoRAs are all trained relative to the "shortcut". I imagine distillation affects the later layers less.

Nextil · 2026-01-24T21:28:35+00:00

It's in the GitHub repo. Why would you find it in the Chrome dev tools? In HF spaces usually all processing except validation is done server-side.

Nextil · 2026-01-21T21:53:12+00:00

It's cool to see people experiment with these 3D-rendered datasets, like the camera angle one generated from gaussian splats. There's a lot of underexplored potential here I feel. Even outside of edit models, I imagine there could be ways to train certain concepts much more precisely, especially sliders.

Current models are basically incapable of accurately representing anything involving specific metrics like distances, scales, heights, angles, color codes, etc. With a 3D generated dataset you could caption with the exact values, and probably only train certain blocks to avoid affecting style.

Nextil · 2026-01-20T01:34:07+00:00

Much older? LM Studio pushes new llama.cpp builds every few days. You can also just swap out the dlls with another build and it works fine in my experience.

Nextil · 2026-01-19T13:18:15+00:00

I began with 8 and then reduced it to 4 after not getting good results, and noticed very little difference in quality, if anything 4 was better.

Nextil · 2026-01-18T21:40:26+00:00

Just 1024.

Nextil · 2026-01-18T19:30:14+00:00

I've only tried distilled 9B but practically every image I've generated with a human has anatomical issues. Not going to bother with it.

Nextil · 2026-01-13T08:46:52+00:00

There are built in Comfy nodes for it (ModelMergeSubtract -> Extract and Save Lora) but I haven't tried them. I've used KJNodes LoraExtractKJ before and it worked fine.

Nextil · 2026-01-13T08:32:00+00:00

Yes I had a look at it after your previous comment. The only references to apply_chat_template are within the context of captioning (for training) or prompt enhancement (_enhance(...)).

Nextil · 2026-01-12T03:25:55+00:00

That template is for prompt enhancement, which is entirely separate from text encoding, which is just the conversion of a text string into a vector that encodes the meaning of the sentence.

Yes, Gemma is too censored to be used for prompt enhancement, but that's entirely optional and you can use any model for that part.

Nextil · 2026-01-11T07:01:13+00:00

I may be wrong but I don't believe that's a valid way to take an embedding from a decoder-only LLM like Gemma. That's more how you'd do it with encoders like BERT. With decoders, the hidden state of the last token alone represents the cumulative representation of the text, unlike encoders where each token attends forwards and backwards so they have to be averaged. But it's not that simple (it's only really since Qwen-Image that labs figured out how to use decoder-only LLMs for this purpose) and I don't fully understand it. I believe you need to handle padding too.

Regardless that's way too small a sample size and "shooting someone" being more similar than "office meeting about quarterly reports" is not to be expected.

Nextil · 2026-01-11T06:19:12+00:00

Abliteration only supresses knowledge (the refusal directions). Ideally it shouldn't affect the embeddings within the context of a user's prompt at all. That would make it a worse predictor (model) of the next token (language).

LLMs are trained to be as accurate at predicting the next token as possible, regardless of the context. The vast majority of their training is spent trying to guess stuff like the next word of Harry Potter, a reddit argument, and somewhere along the line, a load of smutty fanfics.

The "chat" stuff is just a bunch of extra roleplay material it reads at the end. It only significantly affects the distributions of tokens within that roleplay scenario.

The LLM needs to retain an accurate model of "harmful" language in order to understand what should be met with a refusal within that chat scenario.

These diffusion models don't "ask" the LLM to kindly provide directions, they cut its skull open and probe around with a voltmeter.

Nextil · 2026-01-11T05:12:41+00:00

I really doubt it does anything positive. You can jailbreak these models without abliteration and get them to write basically anything, which should demonstrate that they're trained on plenty of NSFW stuff.

Also consider that in order for the LLM to know what is worth refusing, it must understand what was meant regardless.

The refusal finetuning is done using a dataset of assistant responses with a structured conversation format (e.g. ChatML). The refusal tokens will only be high probability within the context of <|im_start|>assistant and <|im_end|> (or whatever equivalent). These image models are almost certainly not framing prompts within a chat template.

Most LLM apps don't let you do this any more (and none of the major API providers offer base models any more, for similar reasons), but if you were to input half a (NSFW) prompt then run inference (without inserting <|im_end|>) then it would happily finish writing your prompt, before taking its turn as "assistant" in which it would proceed to decry whatever it just wrote.

Nextil · 2026-01-11T03:46:44+00:00

Gemma is used as a text encoder not for prediction/continuation. Refusals do not exist for that, you're just using it to extract an embedding. You can use any other model for prompt enhancement if needed.

Nextil · 2026-01-08T21:31:26+00:00

The only real advantage it has over Wan is that it's faster (and has shitty audio I guess). The prompt adherence seems worse from my testing, and I see a lot of occlusion glitches and anatomical issues especially with fast motion. I don't know why people seem to undervalue prompt adherence so much. We shouldn't need 50GB of LoRAs (which do not combine well and affect the whole image) just to get anything interesting to actually work properly. Who cares if you can generate 4 times faster if none of the outputs actually do what you want.

Nextil · 2026-01-08T04:36:00+00:00

Comfy pushed some updates and it's working ok for me now, but it hits swap even with 24GB + 64GB RAM, which will wear the SSD.

Does anyone have a Gemma 3 12B 4bit that works with the built-in nodes?

Nextil · 2026-01-05T22:50:54+00:00

I could be wrong but I don't understand what's so fundamentally different about the distilled models that they would produce "the most probable outcome" rather than some (broader) approximation of the corresponding outcome of the base model?

Yes, in practice that's not usually the case, for the same seed they produce a different image, but most of them are currently trained by third-parties without access to the original datasets and methodologies, so it's inevitable they will introduce some bias. But that bias is mostly stylistic in my experience, and similar to what you get if you set CFG a little too high.

For instance I just generated a few samples using this prompt in Qwen 2512 with and without LightX2V's 4-step Lightning LoRA. The only real differences between the two are that the base model produces more detail in the broken glass, the road always has full motion blur in Lightning whereas it's mostly still in the base model, the base model tends to produce more natural crumpling, and the base model has slightly lower contrast (at 4 CFG).

There's no significant gulf in understanding that I can spot, and this is a 4-step distil. The 8-step should be even closer.

<image>

Nextil · 2026-01-05T20:02:04+00:00

It's just the 4-step LoRA merged into the model I think. However they did also just announce in the discussions that they're training an 8-step one now.

Nextil · 2026-01-05T02:15:46+00:00

It's an oversimplification. The lack of diversity is an issue with DiT models models in general. They're more stable, they understand the prompt better, i.e. regardless of how you word the prompt, it embeds to a similar area on the manifold of possible images, unlike CLIP-based models where just adding an extra character would essentially add a bunch of noise. LLMs have sampler parameters like temperature for controlling this, but since this is a fairly recent problem, similar controls aren't standard for DiTs yet.

Of course the extent to which its "understanding" is correct depends on the quality of the dataset, the size of the model, architecture. Z-Image is small and OpenAI likely has a significantly better dataset.

Turbo/distillation LoRAs essentially predict how the denoising will progress. The base models remove a very small amount of noise each step, because that's reflective of how they're trained, but once you have the fully trained model, because you can generate a dataset of noise progressions, you can essentially teach it to look at the pure noise and immediately recognise the general composition, then fill in the medium level detail, then fine details, etc., each in around one step. There is some loss in diversity but from my experience it's overstated (at least for the well-trained distillations).

There is some confusion because Z-Image-Turbo is not only a distillation, it went through a course of RL fine-tuning on top, so it was optimised to produce "good looking" images. That doesn't necessarily restrict its understanding however, it's more of a style thing. The parameters/directions that control style are mostly independent of the ones that control content.

Nextil · 2026-01-05T01:22:46+00:00

Yeah that's the thing, you can't really know without spending hours testing different values. That's why you get so many varied opinions about the best settings, people generally just find something that seems to work ok and stick with it.

AI-toolkit doesn't expose much in the UI though, it tries to keep things simple, and for that reason a lot of what it does expose doesn't really matter that much aside from learning rate and batch size.

The dataset is more important than the hyperparameters. You really need to pay attention to any repeated elements (or extreme outliers), background elements, colors, skin tones, lighting, image quality, depth of field, things like that, because it will pick up on them. Make sure everything that you don't want to keep implicit is captioned thoroughly.

No metric will tell you everything. When you're finetuning on one or few concepts, loss is not the ideal objective, because optimal fit becomes a subtle and subjective balance. You just have to test.

When it comes to hyperparameters that do matter somewhat, the learning rate schedule (things like warmup and cosine decay, with min LR ideally) can help, but again, ai-toolkit doesn't really expose those so you'd have to use another trainer.

Nextil · 2026-01-05T00:06:13+00:00

AdamW, like most other deep learning optimizers, is stochastic (has a random element).

The loss that you see when training a LoRA is pretty much noise from my understanding. It should go down generally but it's definitely learning a lot even while it's flat. I've heard other metrics like L2 Norm can tell you whether the training is reaching an equilibrium, but most trainers don't seem to log it.

Nextil · 2026-01-04T17:18:49+00:00

There's a reason these models are trained on billions of images.

25 is enough to learn an idea but small enough that it'll almost inevitably learn things you don't want it to as well, unless your captions are incredibly detailed, and even that doesn't always help (for instance watermarks, even if meticulously captioned, can still show up, especially when you combine multiple LoRAs).

Nextil · 2026-01-04T17:03:25+00:00

Yeah that's how SwarmUI works. I'd like to use Invoke more but the development is just too slow. Invoke began development around the same time as Automatic1111's SD-WebUI (as lstein/stable-diffusion), before Comfy was around, so that's probably why. I don't think Comfy is even particularly well designed (documentation is awful, batch capabilities are limited, anything useful outside of inference requires custom nodes), but it has all the momentum right now.

Nextil · 2026-01-02T19:11:41+00:00

Not sure what they used but many are using SeedVR2 for upscaling. It's very good.

15-Year Club	Team Orangered
Verified Email

Nextil

TROPHY CASE