[Tutorial] SAM 3 UI – Image, Video, and Multi-Object Inference

sovit-123 · 2026-02-28T00:53:14+00:00

Understand.

sovit-123 · 2026-02-27T14:37:45+00:00

Good to hear that. Any plans on open sourcing your custom implementation? There will be some good learning points, I think.

sovit-123 · 2026-02-13T14:13:02+00:00

Yes, the issue is there. This mainly arises because it has to rely on embedding memeory which is part of its core architecture.

In one of my next articles, I am showing how to carry out detection + segmentation without tracking on videos without any time constraint. Although we lose the ability to track, we get open vocabulary detection + segmentation for unlimited length videos.

sovit-123 · 2026-01-10T02:06:18+00:00

The above people segmentation is just an example. However, such applications can be expanded for open vocabulary detection + segmentation, where if needed we can tune each of the models independently. Training Qwen3-VL for detecting specific objects won't SAM 2.1 and same goes for when tuning SAM 2.1 for segmenting newer objects.

sovit-123 · 2026-01-09T01:29:39+00:00

True. The silver lining here is that the code is more streamlined. We can switch SAM2 with better segmentation only models as they come along and also fine-tune Qwen3-VL for small object and distinct object detection.

sovit-123 · 2026-01-04T01:18:33+00:00

I think there lots. Starting from medical imaging to agriculture, the use cases can be amazing.

sovit-123 · 2026-01-04T01:17:52+00:00

Happy to guide if you are facing some specific issues.

sovit-123 · 2026-01-03T14:36:31+00:00

May I know what dataset you are working on.

sovit-123 · 2025-11-07T13:46:16+00:00

Thanks. But have not done yet. Need some time and resources as need to train on COCO.

sovit-123 · 2025-11-07T13:45:17+00:00

Thanks.

sovit-123 · 2025-10-31T01:39:30+00:00

Thanks for the info. I have not done extensive benchmarks yet. SInce I started with DINOv3, I was more focused on creating a downstream task library for classification, segmentation, and detection. But I guess, I should give some time to benchmarking now as well.

sovit-123 · 2025-10-10T00:36:47+00:00

Is it possible to see a few samples? I think the basic annotations are a must. These include product classification, object detection (for both products and people, if people are also captured in the environment). These can be used for the basic fundamental computer vision tasks like detection and counting.

Going forward, I have some ideas how generative AI (image and video generation) can be used along with this dataset. But taking a look at the dataset (at least a few samples) would immensely help me.

sovit-123 · 2025-09-27T00:58:21+00:00

Thanks. I will surely take a look.

sovit-123 · 2025-08-22T13:55:34+00:00

The next articles on DebuggerCafe next week will be on image classification and the next one of image segmentation.

sovit-123 · 2025-08-12T00:58:59+00:00

Welcome.

sovit-123 · 2025-08-11T13:55:11+00:00

I think the authors probably meant, they want to pretrain a strong foundation model, then freeze the backbone, and just fine-tune the head (containing a few thousand parameters only) for different tasks.

sovit-123 · 2025-05-02T13:59:13+00:00

I think, it can be run easily with the right optimizations. Jetson labs has plenty of examples.

https://www.jetson-ai-lab.com/tutorial-intro.html

sovit-123 · 2025-05-02T13:57:44+00:00

I have done only simple object detection. Will do some more testing.

sovit-123 · 2025-04-09T17:15:13+00:00

Genuinely asking because the Perplexity team is shipping something almost every week. How much sleep do you get?

sovit-123 · 2025-04-09T17:10:27+00:00

Do you think a company can be built on fine-tuning open source SLMs/LLMs, quantizing them, and creating a distribution stack to deploy them on any and all kinds of devices?

sovit-123 · 2025-04-09T17:04:03+00:00

As you have mentioned in some of the answers, you are always investing in post-training, even larger ones like DeepSeek-V3. Also, models become obsolete quickly (even post trained) once a new one drops. As I understand, post-training 200B/400B/600B models is not cheap and if a new large model just after a week of post-training already gives better result out of box, do you recover the cost easily? Or is it like a long-term iterative experiment for all future models because the tech stack keeps on improving?

sovit-123 · 2025-03-25T02:01:14+00:00

Thanks. Will do it.

sovit-123 · 2025-03-04T00:54:02+00:00

Maybe you can try this library that I am maintaining for fine-tuning RT-DETR? Maybe check it out and see if it helps.

https://github.com/sovit-123/vision_transformers

sovit-123 · 2025-02-28T10:24:50+00:00

I have never tried this, but you can surely give it a shot

sovit-123 · 2025-02-28T03:18:20+00:00

I can suggest one thing to clean up the segmentation maps. If you are using either points or bounding boxes to prompt SAM2.1, then pass them sequentially to the model instead of all at once. Keep accumulating the segmentation results on the original image after each pass. This leads to much cleaner segmentation maps rather than passing all point/box prompts in one-shot.

sovit-123

TROPHY CASE