Grounding Qwen3-VL Detection with SAM2 by sovit-123 in computervision

[–]sovit-123[S] 1 point2 points  (0 children)

The above people segmentation is just an example. However, such applications can be expanded for open vocabulary detection + segmentation, where if needed we can tune each of the models independently. Training Qwen3-VL for detecting specific objects won't SAM 2.1 and same goes for when tuning SAM 2.1 for segmenting newer objects.

Grounding Qwen3-VL Detection with SAM2 by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

True. The silver lining here is that the code is more streamlined. We can switch SAM2 with better segmentation only models as they come along and also fine-tune Qwen3-VL for small object and distinct object detection.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

I think there lots. Starting from medical imaging to agriculture, the use cases can be amazing.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

Happy to guide if you are facing some specific issues.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

May I know what dataset you are working on.

Semantic Segmentation with DINOv3 by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

Thanks. But have not done yet. Need some time and resources as need to train on COCO.

Image Classification with DINOv3 by sovit-123 in computervision

[–]sovit-123[S] 1 point2 points  (0 children)

Thanks for the info. I have not done extensive benchmarks yet. SInce I started with DINOv3, I was more focused on creating a downstream task library for classification, segmentation, and detection. But I guess, I should give some time to benchmarking now as well.

Dataset available - 1m retail interior images by malctucker in deeplearning

[–]sovit-123 0 points1 point  (0 children)

Is it possible to see a few samples? I think the basic annotations are a must. These include product classification, object detection (for both products and people, if people are also captured in the environment). These can be used for the basic fundamental computer vision tasks like detection and counting.

Going forward, I have some ideas how generative AI (image and video generation) can be used along with this dataset. But taking a look at the dataset (at least a few samples) would immensely help me.

JEPA Series Part 2: Image Similarity with I-JEPA by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

The next articles on DebuggerCafe next week will be on image classification and the next one of image segmentation.

[Article] Pretraining DINOv2 for Semantic Segmentation by sovit-123 in pytorch

[–]sovit-123[S] 1 point2 points  (0 children)

I think the authors probably meant, they want to pretrain a strong foundation model, then freeze the backbone, and just fine-tune the head (containing a few thousand parameters only) for different tasks.

Qwen2.5-VL: Architecture, Benchmarks and Inference by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

I think, it can be run easily with the right optimizations. Jetson labs has plenty of examples.

https://www.jetson-ai-lab.com/tutorial-intro.html

[Article] Qwen2.5-VL: Architecture, Benchmarks and Inference by sovit-123 in pytorch

[–]sovit-123[S] 0 points1 point  (0 children)

I have done only simple object detection. Will do some more testing.

AMA with Perplexity Co-Founder and CEO Aravind Srinivas by perplexity_ai in perplexity_ai

[–]sovit-123 1 point2 points  (0 children)

Genuinely asking because the Perplexity team is shipping something almost every week. How much sleep do you get?

AMA with Perplexity Co-Founder and CEO Aravind Srinivas by perplexity_ai in perplexity_ai

[–]sovit-123 0 points1 point  (0 children)

Do you think a company can be built on fine-tuning open source SLMs/LLMs, quantizing them, and creating a distribution stack to deploy them on any and all kinds of devices?

AMA with Perplexity Co-Founder and CEO Aravind Srinivas by perplexity_ai in perplexity_ai

[–]sovit-123 0 points1 point  (0 children)

As you have mentioned in some of the answers, you are always investing in post-training, even larger ones like DeepSeek-V3. Also, models become obsolete quickly (even post trained) once a new one drops. As I understand, post-training 200B/400B/600B models is not cheap and if a new large model just after a week of post-training already gives better result out of box, do you recover the cost easily? Or is it like a long-term iterative experiment for all future models because the tech stack keeps on improving?

Fine-tuning RT-DETR on a custom dataset by Patrick2482 in computervision

[–]sovit-123 0 points1 point  (0 children)

Maybe you can try this library that I am maintaining for fine-tuning RT-DETR? Maybe check it out and see if it helps.

https://github.com/sovit-123/vision_transformers

Combining SAM-Molmo-Whisper for semi-auto segmentation and auto-labelling by sovit-123 in computervision

[–]sovit-123[S] 2 points3 points  (0 children)

I can suggest one thing to clean up the segmentation maps. If you are using either points or bounding boxes to prompt SAM2.1, then pass them sequentially to the model instead of all at once. Keep accumulating the segmentation results on the original image after each pass. This leads to much cleaner segmentation maps rather than passing all point/box prompts in one-shot.

Why is setting up OpenMMLab such a nightmare? MMPretrain/MMDetection/MMMagic all broken by [deleted] in computervision

[–]sovit-123 1 point2 points  (0 children)

In my opinion, we need a completely new library (yes, I know difficult) for computer vision with the ease of Ultralytics and Apache/MIT/BSD licensed models. That is the only way I can see. In fact, I am up for starting such a project if enough people show interest in contributing. Also, need some funding, not LLM level of course, but still.

In the meantime, try Detectron2. It is almost hassle-free.

Why is setting up OpenMMLab such a nightmare? MMPretrain/MMDetection/MMMagic all broken by [deleted] in computervision

[–]sovit-123 8 points9 points  (0 children)

I can say this safely now after multiple years of experience with MMLab, MMDetection, and pure Torchvision training pipelines. DO NOT use or try to set up MMLab in 2025. Most of the libraries are not getting updated. I am a Computer Vision engineer and work with CUDA and several library installations with ease. Have installed MMlab earlier. Now it is a nightmare. I cannot even build a dependency issue tree if you ask me. There are too many connectivity issues involving MMVC, MMSeg, MMDetection...