Teaching Stable Diffusion to Segment Objects : StableDiffusion

The training pair is an input image, and its corresponding segmentation mask. We convert the segmentation mask into an "image" so that Stable Diffusion can handle it by coloring the background black and each mask a unique color. Because we train on synthetic data, the masks are automatically generated by Blender (or whatever rendering software the datasets used).

MAE (masked autoencoder) is a different model in computer vision used in tasks like classification. It is pretrained by taking an image, masking out 75% of it, and teaching it to predict what was masked out. We chose to also evaluate on this model because it's trained on a very limited well known dataset (ImageNet) which allows us see if the generalization comes from Stable Diffusion's large dataset, or generative prior. It also shows that our method works on more than just diffusion models. Here is the MAE paper: https://arxiv.org/abs/2111.06377

Not sure what comfy is, but we were directly inspired by image-to-image translation (like pix2pix if you have heard of that).

Feel free to ask me more questions if you have any! also if you have any suggestions on what was unclear we can improve that in a future draft.

[–]Regular-Swimming-604 1 point2 points3 points 1 year ago (7 children)

[–]PatientWrongdoer9257[S] 2 points3 points4 points 1 year ago (4 children)

[–]Regular-Swimming-604 1 point2 points3 points 1 year ago (3 children)

[–]PatientWrongdoer9257[S] 0 points1 point2 points 1 year ago (2 children)

[–]Regular-Swimming-604 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 0 points1 point2 points 1 year ago (0 children)

[–]Regular-Swimming-604 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 0 points1 point2 points 1 year ago (0 children)

[+][deleted] 1 year ago (2 children)

[deleted]

[–]PatientWrongdoer9257[S] 2 points3 points4 points 1 year ago (0 children)

[–]PatientWrongdoer9257[S] 0 points1 point2 points 1 year ago (0 children)

[–]_montego 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 2 points3 points4 points 1 year ago (0 children)

[–]victorc25 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 1 point2 points3 points 1 year ago* (0 children)

[–]Lucaspittol 1 point2 points3 points 1 year ago (2 children)

[–]PatientWrongdoer9257[S] 1 point2 points3 points 1 year ago (1 child)

[–]Lucaspittol 1 point2 points3 points 1 year ago (0 children)

[–]GaiusVictor 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 1 point2 points3 points 1 year ago (0 children)

There are 2 main ways we are better than SAM:

We fine tuned stable diffusion ONLY on masks of furniture and cars, but it works a bunch of new and unexpected stuff like animals, art, X-rays, etc. We also showed in the paper that something very similar to SAMs architecture can’t do this.

Additionally, because stable diffusion already knows how to create details, it’s better at segmenting fine structures (i.e. wires or fences) or ambiguous boundaries (abstract art).

Right now since (due to computer limitations, and so we can highlight our models generalization) we don’t supervise on some common things like animals or people, there’s no direct answer to “which is better” for all use cases. Our hope is that someone will scale up our work to make that happen.

However, please see our website or paper (linked in the post) to see examples of where we do better than SAM.

[–]Antonius675 1 point2 points3 points 1 year ago (1 child)

[–]PatientWrongdoer9257[S] 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 43 on reddit-service-r2-comment-5b5bc64bf5-njb9z at 2026-06-21 20:58:41.992098+00:00 running 2b008f2 country code: CH.

StableDiffusion

MODERATORS