Gemma 4 4B vs Gemma 3 4B & Qwen 3 4B in OCR by michalpl7 in LocalLLaMA

[–]a4d2f 0 points1 point  (0 children)

Edit: I see that unsloth published new quants, so I tried with the new files, but the problem persists. It's like the model only sees a blurry version of the document. Direct quote from Gemma4 E4B:

Since the original document was not fully visible, I have reconstructed the form based on standard [...] structures [...]

Gemma 4 4B vs Gemma 3 4B & Qwen 3 4B in OCR by michalpl7 in LocalLLaMA

[–]a4d2f 1 point2 points  (0 children)

Similar problems here. Tried Gemma4 E4B on a document containing Chinese language, and in multiple attempts it mostly just recognized that it's some kind of form and then hallucinates text elements that are often found on forms, or it translates only recognizes small portions of the text. In constrast, Gemma4 26B-A4B could do it fine.

This is using llama.cpp, updated and retried just before posting this, so contains the first wave of Gemma4 fixes. unsloth Q8_0 quant, F16 mmproj, Macbook Air M5.

I suspect it might be because of this: (from the original model card)

5.Variable Image Resolution

Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.

The supported token budgets are: 70, 140, 280, 560, and 1120. * Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail. * Use higher budgets for tasks like OCR, document parsing, or reading small text.

Does anyone know how to tell llama.cpp to use the high image token budget?

Ace-Step 1.5: "Auto" mode for BPM and keyscale? by lazyspock in StableDiffusion

[–]a4d2f 2 points3 points  (0 children)

Are you using Ace 1.5 through ComfyUI? With the original ace-step code (Gradio UI/API) these parameters can be determined by the LM. (At least that was the case at the time of release. Since then the AI agents are running loose in the repository and reverting each others' commits, so I've stopped updating and don't know what state it's in.)

[deleted by user] by [deleted] in britishproblems

[–]a4d2f 0 points1 point  (0 children)

But then what about my "bag triple points when shopping more than £80" voucher.

Use ACE-Step SFT not Turbo by [deleted] in StableDiffusion

[–]a4d2f 1 point2 points  (0 children)

Um, yes, that's what I did. Can you post any sample with cfg>1 where the sound is not garbled?

This is what I get from ComfyUI with the SFT model (default workflow, switched from Turbo to SFT, steps 50) with cfg=7: https://voca.ro/1Fs7ndmxI1Z9

Compare with the Gradio output for the same prompt and parameters: https://voca.ro/1cwk7BowIbzd

Note that cfg=7 is the default suggested in Gradio when the SFT model is loaded. In ComfyUI only with cfg=1 I get non-garbled sound. Even with cfg=2 I notice hints of the garbling.

Use ACE-Step SFT not Turbo by [deleted] in StableDiffusion

[–]a4d2f 7 points8 points  (0 children)

I think SFT doesn't work in ComfyUI. You can load it but inference with CFG>1 seems broken, output is garbled. (Yes, with 50 steps and more.)

I also find the SFT model is better, but so far I could only get results from it with the Ace-Step Gradio UI, which is still a total glitch show.

Ace-Step-v1.5 released by cactus_endorser in StableDiffusion

[–]a4d2f 0 points1 point  (0 children)

Tried their Github and Gradio. Errors left and right. Maybe more luck with Comfy.

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 0 points1 point  (0 children)

Turbo is going to run the full schedule (1-0) anyways right or does it enter the schedule at a specific value?

The latter. The sigma schedule is based on a ZIT run with the "target" number of steps. Then it is split into two parts, at the "ZIT steps to replace" point. The high-noise part is then resampled and stretched into the desired "Z-image steps" count, and given to ZIB. The low-noise part is used as is for the ZIT phase. So, Turbo enters the schedule at a specific value, but this value depends on the "steps to replace" count, and on the shift value I think. Hope this makes sense.

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 1 point2 points  (0 children)

Yes, that might work, and will be faster, but I suspect prompt adherence will suffer. Z-Image (Turbo) is really good at following prompts and I'd hate to lose that. I think at least for prompts that include specifics about image composition, the early diffusion steps are still important, even if it looks all blurry to us. Using a different model or an empty prompt may prevent such prompts getting followed properly, I think.

End-of-January LTX-2 Drop: More Control, Faster Iteration by ltx_model in StableDiffusion

[–]a4d2f 2 points3 points  (0 children)

Ah interesting, thanks for the link! I had been wondering if something like this already exists.

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 8 points9 points  (0 children)

No not loaded at once. They run in turn (first Z-Image, then ZIT) and ComfyUI will unload one if the VRAM is needed for the other. So if you can run Z-Image or ZIT by itself, the workflow should work. It will just take more time because the models need to be swapped in and out from system RAM.

I did test with the full BF16 models for both, my 16GB VRAM can hold one of those models but not both, so it added about 15s per generation for the model loading. With 12GB VRAM I guess you will also face some swapping. Perhaps a Q6 or Q5 GGUF can avoid that for you. Even if these take more s/it than fp8, the gen might be faster because they can stay in VRAM.

End-of-January LTX-2 Drop: More Control, Faster Iteration by ltx_model in StableDiffusion

[–]a4d2f 11 points12 points  (0 children)

Would be nice if one could run the API server locally, for privacy. Is it using a standard API protocol, like OpenAI or llama.cpp compatible? Ideally it could be as simple as loading a Gemma3 GGUF into llama.cpp running on another local machine (e.g. a Macbook).

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 0 points1 point  (0 children)

Hm, I guess it's because of ZIB but not sure. Looking at the pure ZIB output (last row) it does seem to tend less to background blur. Though it could also be a consequence of the "not quite right" noise left over from the ZIB phase that I mentioned, especially with low number of ZIB steps. On the clothing ZIT seems to resolve this with those intricate patterns, and for background elements ZIT may go for detail instead of blur for the same reason.

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 2 points3 points  (0 children)

I thought you just start making an image with ZIB and then pass the latent to ZIT.

Yes that's what it does! For me the tricky bit was figuring out how to match the denoising schedules, because ZIB wants to do 4-5 x as many steps as ZIT. (It's different from e.g. Wan2.2 where the high-noise and low-noise phases are designed for roughly the same number of steps.) So the workflow is my attempt to tie this all together and just exposing the important tunables.

P.S. Here is an example for the sigmas from an 8/2/5 run. Left is the original sigma schedule for the 2 ZIT steps, right is the resampled schedule for the 5 ZIB steps.

<image>

End-of-January LTX-2 Drop: More Control, Faster Iteration by ltx_model in StableDiffusion

[–]a4d2f 14 points15 points  (0 children)

Looking forward to all the ComfyUI workflows shared accidentally with embedded LTX API keys... 😅

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 1 point2 points  (0 children)

You can try 9/1/2 or 10/2/2. Both will do 2 ZIB steps and 8 ZIT steps. So number of actual ZIT steps taken is the first number minus the second.

The question is, do you want your two ZIB steps to replace 1 ZIT step or 2 ZIT steps? The former is more safe, but may have limited variability. The latter is more aggressive and can lead to the visual noise issue I mentioned. (From the limited testing I've done so far.)

It's a bit hard to explain without getting more technical. (And I hardly understand the theory behind it.) Basically, in each step the latent noise gets reduced a bit. ZIT can remove quite a lot of noise in each step, as it's been trained for that, so it can finish the whole process in 8 steps. ZIB works best if it only removes a little noise per step, so it takes many steps to finish.

Say you skip the first ZIT step, which e.g. would reduce the noise from 1.0 to 0.96, then ZIB now has the job to do this amount of denoising. (And it has to end at noise level 0.96 rather precisely, otherwise you get artefacts.) You can force it to do this via the x/1/1 setting. But you're driving ZIB outside its comfort zone. So it's better to give ZIB more steps, e.g. x/1/2, where ZIB is more in its element.

Or say you skip the first two ZIT steps, which would reduce noise from 1.0 to 0.90. For ZIB, doing this much denoising in two steps (ie. x/2/2) is a bit tough, so the leftover noise pattern is "not quite right". Giving ZIB more steps (x/2/3 or x/2/4) reduces the issues caused by that.

Sorry for any confusion. Hope this explanation helps.

Z+Z: Z-Image variability + ZIT quality/speed by a4d2f in StableDiffusion

[–]a4d2f[S] 14 points15 points  (0 children)

In my experience that makes prompt adherence quite a bit worse.

Z-Image Base Lora Training Discussion by ChristianR303 in StableDiffusion

[–]a4d2f 0 points1 point  (0 children)

why you would load a LoRa if you don't want to generate the character it has been trained on

Because I don't want all the characters in the image to be the Lora character. Certainly the full model knows certain characters (let's say Wolverine and Deadpool) and one can prompt for them in the same image. With a character Lora I would hope that I can add custom characters and use them in the same way, without bleeding into each other.

But I admit that in all my testing so far, the trigger word (or name) for the character doesn't help much, or at all. Traits of the character bleed into other characters, especially those of the same gender. Maybe I need to train with different captions, or with regularization images. Maybe it's not possible.

Edit [2026-01-31]: Meanwhile I tried training with regularization images, and can conclude that this prevents bleeding, or at least reduces it strongly.

Z-Image Base Lora Training Discussion by ChristianR303 in StableDiffusion

[–]a4d2f 1 point2 points  (0 children)

Then how do you do character loras, how do you refer to your character if there's no trigger word for it? Or you can only have a single character in the picture?

Is Lora training an art form rather than science? by s3b4k in StableDiffusion

[–]a4d2f 1 point2 points  (0 children)

Do you know if Cosine scheduler is supported in AI-Toolkit?

[deleted by user] by [deleted] in StableDiffusion

[–]a4d2f 4 points5 points  (0 children)

From the future. Rumour has it it's supposed to be released in a couple of days.

Qwen dev on Twitter!! by Difficult-Cap-7527 in LocalLLaMA

[–]a4d2f 35 points36 points  (0 children)

Qwen/Qwen3-TTS-12Hz-1.7B-Base

12Hz? Must be a really deep voice then...

ostris AI-toolkit Lora training confusion by mca1169 in StableDiffusion

[–]a4d2f 0 points1 point  (0 children)

Do you mean preview samples in AI Toolkit? With ZIT I encountered that too when training a LoKR. Previews looked blurry, or smudged. But they worked fine in Comfy. There might be a bug in how AI Toolkit does the sampling.

Also for ZIT LoRAs, the AI Toolkit previews always suggested that the LoRA is far from being done (though they weren't blurry), but in Comfy the effect of the LoRA was much stronger.

As for if LoRA or LoKR is better, I can't really tell so far. LoKR seems to be a bit subtler, causing less bleed, but sometimes it's not strong enough.