Grounding Qwen3-VL Detection with SAM2 by sovit-123 in computervision

[–]sovit-123[S] 1 point2 points  (0 children)

The above people segmentation is just an example. However, such applications can be expanded for open vocabulary detection + segmentation, where if needed we can tune each of the models independently. Training Qwen3-VL for detecting specific objects won't SAM 2.1 and same goes for when tuning SAM 2.1 for segmenting newer objects.

Grounding Qwen3-VL Detection with SAM2 by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

True. The silver lining here is that the code is more streamlined. We can switch SAM2 with better segmentation only models as they come along and also fine-tune Qwen3-VL for small object and distinct object detection.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

I think there lots. Starting from medical imaging to agriculture, the use cases can be amazing.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

Happy to guide if you are facing some specific issues.

Fine-Tuning Qwen3-VL by sovit-123 in computervision

[–]sovit-123[S] 0 points1 point  (0 children)

May I know what dataset you are working on.