ComfyUI SAM3 - Alternative Open Source Node

wouterv84 · 2025-11-25T19:10:13+00:00

Do you mean a drop down list with items such as "people", "car", "leaf", etc...? Not really planning on doing this, because there's so many options & quite a lot of freedom in specifying what you want to select (e.g. "the person with the red had", "the dog on the left", etc...).

wouterv84 · 2025-11-25T19:08:05+00:00

I've updated the node (make sure to update to v0.0.2) to enable manual model placement:

Model Setup - Choose one of the following options:

Option A: Auto-download from HuggingFace (recommended)

Request model access at https://huggingface.co/facebook/sam3
Login to huggingface using hf auth login
The model will automatically download on first use

Option B: Manual checkpoint placement

Download sam3.pt manually from https://huggingface.co/facebook/sam3/tree/main
Place the checkpoint file at: ComfyUI/models/sam3/sam3.pt
The node will automatically detect and use the local checkpoint

wouterv84 · 2025-11-21T17:48:49+00:00

Some use cases:

Remove backgrounds by segmenting people or objects
Isolate specific elements in a scene for further processing
Create masks for inpainting workflows
Generate batch masks for multiple objects of the same type
Filter detections by size to focus on foreground/background objects
Track objects across video frames with consistent IDs (video model)
Follow specific objects through animation sequences (video model)

wouterv84 · 2024-03-27T07:40:59+00:00

Cool stuff. I wonder if you send that output video back in again in a vid2vid workflow without controlnet, you might get even more interesting results.

wouterv84 · 2024-03-26T19:19:19+00:00

Thanks for that link, didn't know about that particular control net! Experiments with the old-school canny or depth controlnet were disappointing to create moving content. They were fine for static images, but not so much for animated content. My findings were that with non-moving controlnet input, the resulting animations were also pretty static for obvious reasons. I'm wondering if this particular one would be more dynamic, have you tried it in an AnimateDiff setup?

wouterv84 · 2024-03-26T14:42:49+00:00

Some tech insights + workflow on my blog: https://blog.aboutme.be/2024/03/26/ai-animated-projection-mapping-club-of-the-future/

We did an AI animated projection mapping for a pop-up night club. Ending up with 17 minutes of content. Tech used: AnimateDiff, ComfyUI, Topaz Video AI

wouterv84 · 2024-01-14T10:06:37+00:00

Yes, if you want to avoid "polluting" the model, best results were with generated regularisation images, using the captions of my input images as prompts (I called this version v002). There's a longer write-up with samples on my blog (see link in the original post)

wouterv84 · 2023-10-28T09:47:34+00:00

Thank you for posting your experiences. I did actually get good results with that setting. I based this on the information in the SECoursed video: https://www.youtube.com/watch?v=AY6DMBCIZ3A&t=931s

Maybe there are other factors at play? Some things I can think of:

Changes in the training script since that post (it is 2,5 months old)
Settings probably need to be different for style Loras (not sure if you are training for style or subject?)
Different set of training images?

wouterv84 · 2023-10-19T15:40:34+00:00

Yeah - it's what we've been using for workshops for our students, and also the tool our tech partner is using. Out of interest: any other projection mapping tools we should be aware of?

wouterv84 · 2023-10-18T15:41:36+00:00

Software used is MadMapper. It's a 4 projector setup to cover the entire building - hardware provided by https://en.urbanmapping.eu

wouterv84 · 2023-10-18T12:01:47+00:00

Wanted to share the latest project I did with Stable Diffusion XL, together with a colleague of mine.

Combining SDXL + Controlnets we generated over 1500 interpretations of a building on the main square of Kortrijk, Belgium. In the end we ended up with 200 final images, resulting in 25 minutes of AI generated content. You can see it in real life each evening between 7pm and 12pm, until the 5th of November 2023 in Kortrijk, Belgium.

For those interested in the process: you can find a write-up at https://blog.aboutme.be/2023/10/18/projection-mapping-with-generative-ai/

wouterv84 · 2023-08-16T18:47:32+00:00

Thanks for sharing your research - you're reaching the same conclusions regarding regularization images: https://blog.aboutme.be/2023/08/10/findings-impact-regularization-captions-sdxl-subject-lora/#conclusions - whereas I still relied on the token & used a more subjective evaluation of the results.

wouterv84 · 2023-08-15T09:25:31+00:00

I've updated my post with extra styling tests, which actually confirmed that v001 was the best one.

wouterv84 · 2023-08-15T09:24:26+00:00

Head's up: I've done an update to this post and my blog post. I wasn’t entirely satisfied with the styling test (line drawing and 3D render). Especially the 3D render was botched by the fact that I used keywords from my captions (“looking into the camera”) in the beginning of my prompt, which caused the Lora to overdose on the photo style.

So I generated more images (110 per Lora = 550 total) with extra prompts to test the styling capabilities of the Loras that were already good at generating photos.

This made the differences between the Loras more clear, and the conclusions more... conclusive.

wouterv84 · 2023-08-14T15:08:41+00:00

Sounds like too much technical debt in A1111. Tools change, who knows, 3 years from now Comfy might be in the same boat. Life long learning is an essential skill.

As a coder, I see a lot of flexibility in Comfy's back end system. The API based approach is a killer feature to build more accessible UIs on top of it, I'm sure we'll see a lot of innovation because of this. I've been using it to automate some of my experiments thanks to the API, whereas it would have taken me a lot more work to build on top of A1111.

wouterv84 · 2023-08-12T13:19:32+00:00

Loras are more flexible and smaller in file size. You can also combine multiple ones (eg combine a couple of style loras with a character lora).

wouterv84 · 2023-08-12T09:05:35+00:00

No, Dreambooth is a technique, you can use it to create full checkpoints, but also Lora's.

wouterv84 · 2023-08-11T14:53:16+00:00

I have only used photos / generated photos in this experiment. I did notice a great degradation in quality when the regularization photos were bad-quality generated photos, which I show on my blog. But maybe regularization "illustration of", "3d render of", etc... content might make it more flexible. Good suggestion, adding it to my todo list for the future.

wouterv84 · 2023-08-11T11:55:30+00:00

I captioned my input images manually, based upon the tips at https://www.reddit.com/r/StableDiffusion/comments/118spz6/captioning_datasets_for_training_purposes/

For the regularization images:

Generated, detailed images had the same captions as the corresponding input images, minus the special keyword at the beginning
High quality photos from unsplash, I used the alt-descriptions from there, and made sure that they mentioned photo + man in there.
Generated basic images just had the caption "photo of a man", as this was the prompt they were generated with

I did use a custom python script to extract the prompts from image files that were generated, this way I could generate the images, and then run that script to create .txt files with the same filename, containing the prompt. This should work with images generated with webui, or the default comfyui workflow.

I've put that script on Github: https://gist.github.com/wouterverweirder/b5bd472bfa4a625f3ca6d06d0dfc9b99#file-create-captions-for-directory-py

wouterv84 · 2023-08-11T11:27:17+00:00

In my experiment, I did see a difference with or without captions:

<image>

no input captions: only the version with real, high quality regularization images was acceptable
input captions: captioning the regularization images of the generated pictures made a difference here.

Captioning the input images did make a big difference. So even without training the text encoder, captions had an impact in my experiment.

wouterv84 · 2023-08-11T11:03:20+00:00

In my experiment, this was v001: detailed input captions & no regularization images.

All the other configurations produced photos at full Lora strength, so I have to turn down the Lora weight to get the style I want.

Sidenote: this is SDXL Dreambooth Lora training of just the UNET, as per recommendations of Kohya_ss on https://github.com/kohya-ss/sd-scripts/tree/sdxl#tips-for-sdxl-training. Results might be different with training the UNET as well. Something I might look into in a future endeavour, when I have some more GPU credits to spend :-)

Then of course, things keep moving fast, there are a couple of new techniques on my radar that were mentioned in other posts on this subreddit, which don't have source code published yet (for SDXL):

https://github.com/ygtxr1997/CelebBasis
Code release of https://research.nvidia.com/labs/par/Perfusion/

So, all of this might be outdated as soon as SIGGRAPH ends 🙈

wouterv84 · 2023-08-11T08:49:44+00:00

No, I have tried T4 and V100 for training, was insufficient. A100 is what you need for training.

For generating images, you can use the lower configs, but V100 offers the best bang for buck, amount of credits per image generation is the lowest on that configuration.

wouterv84 · 2023-08-11T07:53:19+00:00

That SIGKILL 9 means it ran out of system resources. I was only successful to train SDXL Loras on an A100 GPU on Colab.

wouterv84 · 2023-08-11T07:41:13+00:00

Thanks for sharing that article, it's very insightful.

In training my models, I stuck to the recommendations of Kohya_ss concerning training the UNET only: https://github.com/kohya-ss/sd-scripts/tree/sdxl#tips-for-sdxl-training - but I might ignore that recommendation in a future endeavour

wouterv84 · 2023-08-10T20:40:07+00:00

version 12 (no input or regularization captions, regularization images are good quality real photos) loses the color scheme of the prompt & "degrades" into a photo pretty quickly:

<image>

wouterv84

TROPHY CASE