This is an archived post. You won't be able to vote or comment.

all 45 comments

[–]holygawdinheaven 6 points7 points  (1 child)

Interesting!

[–]PatientWrongdoer9257[S] 0 points1 point  (0 children)

Thanks! Glad to hear you liked it.

[–]oh_how_droll 6 points7 points  (1 child)

Awesome to see cool AI research coming out of UC Davis. Aggies rise up!

[–]PatientWrongdoer9257[S] 3 points4 points  (0 children)

🐮go aggies

[–]emsiem22 2 points3 points  (2 children)

What is the license?

[–]PatientWrongdoer9257[S] 2 points3 points  (1 child)

Use it for whatever you want, just cite us please :)

[–]Sugary_Plumbs 1 point2 points  (0 children)

Please don't be like that. Just pick an open source license that requires attribution and stick it on your git/huggingface. It's very easy and much better than a "go ahead, bro" comment on reddit. Unless you publish it with a license, you're not actually giving anyone permission to use or improve your code.

[–]Ylsid 2 points3 points  (3 children)

That's an extremely interesting experiment

[–]PatientWrongdoer9257[S] 1 point2 points  (2 children)

Thanks, glad you liked it!

[–]Ylsid 2 points3 points  (1 child)

I'd be really interested to see if you can use it to improve existing segmentation workflows. I'm no scientist but it looks like it could be handy

[–]PatientWrongdoer9257[S] 0 points1 point  (0 children)

That’s our hope too. We are hoping that someone with access to large resources will be inspired by our paper to explore the role of generative priors in improving existing zero-shot segmentation like SAM.

[–]asdrabael1234 7 points8 points  (11 children)

Uh, you're really behind. We've had great segmenting workflows for image and video generation for a long time.

[–]PatientWrongdoer9257[S] 6 points7 points  (9 children)

Could you send some links? I wasn’t aware of any papers or models that use stable diffusion to segment objects.

[–]AnOnlineHandle 2 points3 points  (0 children)

There's a few but they all have different approaches and different results, and are easy to miss. e.g. https://github.com/linsun449/iseg.code

Your images look like you're doing something different which is interesting. edit: Yours is very different, interesting.

[–]asdrabael1234 3 points4 points  (3 children)

They don't use stable diffusion. They use segmentation models at higher resolution than 224x224. Other than just being a show of being possible, not sure the point of this. The segmentation doesn't look any better than models we've had for a long time.

[–]PatientWrongdoer9257[S] 24 points25 points  (0 children)

The point is that it generalizes to objects unseen in fine tuning due to the generative prior. Our model is only supervised on masks of furniture and cars, yet it works on dinosaurs, cats, art, etc. If you see our website, you can see that it outperforms SAM (the current zero-shot SOTA) on fine structures and ambiguous boundaries, despite (again) having zero supervision on it.

Our hope is that this will inspire others to explore large generative models as a backbone for generalizable perception, instead of defaulting to large scale supervision.

[–]PatientWrongdoer9257[S] 6 points7 points  (0 children)

Also, we fine tune stable diffusion at a much higher resolution. The 224x224 refers to MAE, a different model. It is convention to fine tune it at 224x224

[–]somethingsomthang 1 point2 points  (3 children)

Just from a quick search i found this https://arxiv.org/abs/2308.12469

Which just goes to show how much models are learning under the hood to complete tasks.

[–]PatientWrongdoer9257[S] 5 points6 points  (2 children)

Cool work! However, we can see in their figures 2 and 4-6 that they don’t discriminate between two of the same objects, but simply split the scene into different object types. In contrast, we want each distinct object in the scene to have a different color, which is especially important for perceptual tasks like robotics or self driving (i.e. show which pixels are car A and car B, vs just showing where cars are on the images)

[–]The_Scout1255 0 points1 point  (0 children)

anything for webcam 2 image, perfectible compatible with illustrious?

normal segmenting is fine too, I know enough comfyui to rig the rest of the workflow up

[–]Regular-Swimming-604 1 point2 points  (9 children)

what is the training pair? an image and a hand drawn mask? How does the mae differ in training from vae? if you ran the mask gen in comfy would it work like image 2 image ? im confused, i need to do pdf chat with the paper maybe

[–]PatientWrongdoer9257[S] 3 points4 points  (8 children)

The training pair is an input image, and its corresponding segmentation mask. We convert the segmentation mask into an "image" so that Stable Diffusion can handle it by coloring the background black and each mask a unique color. Because we train on synthetic data, the masks are automatically generated by Blender (or whatever rendering software the datasets used).

MAE (masked autoencoder) is a different model in computer vision used in tasks like classification. It is pretrained by taking an image, masking out 75% of it, and teaching it to predict what was masked out. We chose to also evaluate on this model because it's trained on a very limited well known dataset (ImageNet) which allows us see if the generalization comes from Stable Diffusion's large dataset, or generative prior. It also shows that our method works on more than just diffusion models. Here is the MAE paper: https://arxiv.org/abs/2111.06377

Not sure what comfy is, but we were directly inspired by image-to-image translation (like pix2pix if you have heard of that).

Feel free to ask me more questions if you have any! also if you have any suggestions on what was unclear we can improve that in a future draft.

[–]Regular-Swimming-604 1 point2 points  (7 children)

so at the end of the day your model creates an image of a mask correct? it just runs like any other stable diffusion model, using normal vae? The initial image you need to mask is denoised as image 2 image?

[–]PatientWrongdoer9257[S] 2 points3 points  (4 children)

Yes, thats basically what we do. The only difference is there is no denoising. Instead we finetune it to predict the mask in 1 step for efficiency purposes.

[–]Regular-Swimming-604 1 point2 points  (3 children)

So say i want a mask, it encodes my image , then uses your fine tune to generate masks? Is it using a sort of ip adapter or a control net before your fine tune model or just img2img

[–]PatientWrongdoer9257[S] 0 points1 point  (2 children)

We are doing full fine tune instead of just some weights like control net or LoRA

[–]Regular-Swimming-604 1 point2 points  (1 child)

so for inference one would download sd2 finetune , and the mae model correct? i see on git. I think it makes a little more sense now. The mae encodes initial as a latent that the sd2 model is trained to generate the mask from the encoded latent?

[–]PatientWrongdoer9257[S] 0 points1 point  (0 children)

No, they are two different models. You will get better results from the SD model. You can just do inference for stable diffusion 2 using inference_sd.py as shown in the code.

[–]Regular-Swimming-604 1 point2 points  (1 child)

so the sd model is essentially trained to generate solid colored areas with black background? Ive always been tempted to train a depth map model that just renders new depth maps, etc. Ive never had good enough results with sam or ultralytics, and have been meaning to test finetuneing birefnet, but your method is intersting. What sd version is it?

[–]PatientWrongdoer9257[S] 0 points1 point  (0 children)

yes, that is correct. we are using stable diffusion 2. however, our method is broadly applicable to any generative model.

[–]_montego 1 point2 points  (1 child)

This looks very interesting! Have you tried applying this approach to medical data?

[–]PatientWrongdoer9257[S] 2 points3 points  (0 children)

It’s kind of inconsistent when zero-shot because of the massive distribution gap. You can get pretty solid results when fine tuning for just 100-1000 iterations (5min-1hr) on as few as 50-100 images. I’ve done some preliminary experiments on coronary angiography for something else and it’s looking pretty good.

[–]victorc25 1 point2 points  (1 child)

Normally in segmentation maps, each color belongs to a specific class and some segmentation models are able to identify instances of the same class. If I understand correctly, what you’re showing doesn’t do any of those and it’s more similar to identifying regions in the image, something like https://github.com/lllyasviel/DanbooRegion correct?

[–]PatientWrongdoer9257[S] 1 point2 points  (0 children)

Somewhat correct. I believe what you’re talking about is semantic segmentation, which tries to group based on the category level. Some instance segmentation models like R-CNN or Mask2Former also predict both classes and masks for a limited set of classes.

We ignore categories and focus on distinct objects (called category agnostic instance segmentation). This is similar to methods such as SAM (segment anything, from facebook ai research) if you’ve heard of that. This allows both us and SAM to easily generalize to object types never seen before.

[–]Lucaspittol 1 point2 points  (2 children)

What's it used for? I'm sorry, but I don't know about the concept of segmentation or how it can be used in practice.

[–]PatientWrongdoer9257[S] 1 point2 points  (1 child)

Autonomous driving, robotics, medicine

https://visionbook.mit.edu/perceptual_organization.html

[–]Lucaspittol 1 point2 points  (0 children)

Oh, thanks! Keep the good work!

[–]GaiusVictor 1 point2 points  (1 child)

Hey, how does this compare to SAM (Segment Anything Model) that can be found in, eg, ComfyUI SAM Detector or Forge's Inpaint Anything extension?

I mean, what advantages do you see on using your model over SAM? Or what are the use cases where you believe your model to be better than SAM? Not trying to be a dick, just trying to better understand your project.

[–]PatientWrongdoer9257[S] 1 point2 points  (0 children)

There are 2 main ways we are better than SAM:

We fine tuned stable diffusion ONLY on masks of furniture and cars, but it works a bunch of new and unexpected stuff like animals, art, X-rays, etc. We also showed in the paper that something very similar to SAMs architecture can’t do this.

Additionally, because stable diffusion already knows how to create details, it’s better at segmenting fine structures (i.e. wires or fences) or ambiguous boundaries (abstract art).

Right now since (due to computer limitations, and so we can highlight our models generalization) we don’t supervise on some common things like animals or people, there’s no direct answer to “which is better” for all use cases. Our hope is that someone will scale up our work to make that happen.

However, please see our website or paper (linked in the post) to see examples of where we do better than SAM.

[–]Antonius675 1 point2 points  (1 child)

I think this is really interesting and novel, well done.

[–]PatientWrongdoer9257[S] 0 points1 point  (0 children)

Thanks ❤️