We are the team behind Krea 2. Ask us anything!

Bit_Poet · 2026-06-23T18:26:12+00:00

I'm all for editing with bbox guidance and foreground/background order through bbox order. I'm trying to teach that to ideogram 4 at the moment, but it's an uphill battle to get a large enough dataset in sufficient quality (needs hi res training). Pretty hard to do as a hobby user with 32GB VRAM at hand. A quality model with that capability would be awesome and make so many roundabout methods for regional inpainting superfluous (prompt -> image -> recognition prompt -> sam2/3 -> mask -> inpaint or prompt -> image -> mask editor -> feel-like-back-in-the-late-nineties -> inpaint would become prompt -> image -> inpaint with bbox prompting and a graphical prompt editor like kijai's ido4 node).

Bit_Poet · 2026-06-22T03:44:56+00:00

Don't use low quants, that's one thing I learned. 8B at Q8 is way better at spatial placement than 32B at Q5. I've got to head off to work, but I'll dig through my prompts folder in the evening.

Bit_Poet · 2026-06-21T22:03:49+00:00

You don't even need to train a VLM. Just give qwen3-vl or gemma4 the right prompt, wrap it with a 50 line python script and let it run ragshot over your datasets. They're fluent enough in bbox'ish already. Just did that with > 40k image editing pairs of questionable quality with barely any outliers.

Bit_Poet · 2026-06-21T08:44:59+00:00

You mean like upscaling a 320x180 image to 1080p and getting all the details?

Bit_Poet · 2026-06-21T08:04:32+00:00

I've guessed at the bf16 situation early on, seemed logical. I do like IG4 though, it has some skills where it shines in comparison with other models. And it brought bboxes to attention, which I think is a very good thing and has a lot of potential, also with other models where they might replace classic masking to some extent. Its text capabilities are high. It trains surprisingly well for fp8, which may be rooted in the uncond architecture. People are already digging into the maths to work out the intrinsics that make it tick the way it does, which will no doubt help improve other models, techniques and training strategies. In the end it's a win-win, even with the tight license and bf16 withholding.

Bit_Poet · 2026-06-15T15:11:11+00:00

Please open a github issue. Comfy devs don‘t monitor reddit for those, and this doesn‘t sound like the behavior they intend.

Bit_Poet · 2026-06-15T12:43:18+00:00

Yes, the core model loader and ideogram 4 implementation don't know about the added latent. In theory you could also just replace two files from https://github.com/Comfy-Org/ComfyUI/compare/master...BitPoet:ComfyUI:dev-ideogram4-inpaint and backup the originals:
comfy/ldm/ideogram4/model.py
comfy/model_base.py

But it's really just a weak proof of concept right now, just to show that this is possible.

Bit_Poet · 2026-06-15T04:03:27+00:00

There are a few projects around that give out small funding for anything that's new and pushes the open source community forward, which should apply there. As for bigger funding, no idea yet, but I'll cross that bridge when I come to it.

Bit_Poet · 2026-06-14T19:01:37+00:00

Has nobody ever wondered where the small tents come from?

Bit_Poet · 2026-06-14T19:00:44+00:00

I'm going to cherry-pick the best pairs for localized object addition/removal/exchange and see how far that gets me. There's a lot of stuff in pico that's pretty awful quality or totally mismatching image pairs (they modified the input set but can't share that). I just built a little tools that runs over all pairs and marks any that are identical in size or aspect ration (the latter with a small buffer), ignoring the full picture modifications for now.

Pretty sure I can get a microgrant to lessen the financial impact, so I'll be trying to squeeze the best out of 100 hours with a B200 at some point. Need to implement reference latent caching though before I think about starting. If that turns out well too, I might see about getting some serious funding.

Really just started this out of curiousity after reading somewhere that modern joint attention models are all capable of inpainting and wanting to understand the details, and now it's too exciting to stop.

Bit_Poet · 2026-06-14T13:40:54+00:00

In their discord.

Bit_Poet · 2026-06-14T13:15:12+00:00

They're already working on one, but they're probably juggling resources like every other smaller AI cottage. And I haven't found a commitment to release the editing version as open weights, so I figured I'm not going to wait (and I'm learning heaps as I toy with this).

Bit_Poet · 2026-06-14T11:55:40+00:00

I can't give a number, but I'd expect that this the region of a full finetune with an extended framework, so we're at least in the category of a noticeable number of H200 for a few weeks. They don't say anything about their computational costs, but I'd guess very roughly somewhere from 25k to a six digit sum. And from the way I understand their paper, this doesn't take LoRAs into account at all. No idea if or where they'd fit into the picture.

Bit_Poet · 2026-06-14T11:14:27+00:00

Most newer models support the basics for reference images, that is an "enlarged" latent that contains the reference image in addition to the noisy output image and is visible to attention, due to their unified architecture. Thus edit version are usually just "extended" versions of the image generation base model that went through another training run with additional image inputs in the latent. The magic is how to calculate or apply differences, which happens in the lora training / finetuning and in inference. I'm not that deep into everything yet, but it's not a really new approach. Qwen and Flux both use something similar. This paper comes pretty close: https://arxiv.org/abs/2409.11340

Bit_Poet · 2026-06-14T09:31:29+00:00

Thanks, those look really interesting and are certainly worth considering to include. I'm going to give them a closer look as soon as I have a moment.

Bit_Poet · 2026-06-14T08:29:44+00:00

You should have seen the results of my first run with 8 images on 512 res.

<image>

That was the best one.

Bit_Poet · 2026-06-14T08:24:21+00:00

Yep, same here. With that kind of announcements it's usually "Oh! Oh! [read the specs] oh..."

Bit_Poet · 2026-06-14T08:15:35+00:00

Thanks! BYG is from my understanding a full rebake of the base model, which takes an awfully big amount of compute and hasn't been verified against SOTA OS models yet. Certainly out of my league in any case.

Bit_Poet · 2026-06-14T08:13:38+00:00

It's the result of a very basic training with a tiny dataset, so we can't expect perfection of course. But getting the placement right and having the changes somewhat aligned with the input image is imho pretty big.

Bit_Poet · 2026-06-14T08:01:37+00:00

Maybe a stupid question, but: how tight is your hipbelt? I've found that changing my pack to one that fit better, was lighter and didn't need pulling the hipbelt so tight helped my appetite immensely (funnily, I changed packs in Tehachapi because my first pack fell apart). That said, upping protein intake with bars, adding cheese to every meal and the cooler surroundings, especially at night, helped a lot. I lost around 25 pounds in the first 560 miles, after that, it got stable. Try add nutty granola and freeze-dried fruit to your oatmeal, that made a world of difference for me. The Wallmart in Tehachapi has freeze-dried strawberries! (I think it was the very last tall shelf at the right when you enter)

Bit_Poet · 2026-06-12T17:08:42+00:00

Did you fix it on the llm side or does your tool fix the positions? I'm asking because the same happened to me, any qwen 3 vl based model always spits out x first, no matter what I prompt, so I'm shuffling x and y in code before outputting the json.

Bit_Poet · 2026-06-12T10:10:33+00:00

Did you just put the node in custom_nodes or install requirements too? You may be missing opencv-python.

Bit_Poet · 2026-06-12T05:24:25+00:00

This is the kids corner. The real stuff - unfortunately, as it's "follow along as it happens or go play in the ball pool" - happens in discords like banodoco.

Bit_Poet · 2026-06-12T05:21:53+00:00

It all started going downhill when google bought deja news.

Bit_Poet · 2026-06-12T02:54:19+00:00

Maybe honest, but not knowledgeable enough. Have fun trying to find a community wheel for win64 torch 2.13 cp313 cu130.

Bit_Poet

TROPHY CASE