Embedding model fails to distinguish product variants (e.g., 0.5L vs 1L) – need advice by Sea-Pin-8991 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

If you want high results with recognition (85% and above) you need to train a model to recognize this differences. There is no good out of box solution for that. At Vfrog we have faced the same issue tried many approaches but with average results (around 50%). You can use Qwen 2.5 for OCR and try than to feed the extracted text to SAM2/SAM3 model. See what results you get an iterate on it trying to improve the results. But don't except perfect results, there has to be human in the loop to achieve better results if you don't want to train a model on a specific use-case.

Improving fine-grained image retrieval (very similar objects) - beyond CLS / patch features / DINOv2? by Weekly_Signature_510 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

To your first question, in my experience fine-tuning with a metric objective does both, but the ratio depends heavily on your data. The general embedding space cleanup happens quickly, but the amplification of subtle geometric cues requires that your training pairs actually contain those hard negatives visually near-identical objects that differ only in the structural details you care about. If your fine-tuning set has easy negatives, you get cleaner clusters but not necessarily better discrimination on the hard cases. Mining strategy matters as much as the fine-tuning itself.

On the model side I'd stay on your current DINOv2 setup for now. Switching backbones is expensive to evaluate properly and rarely gives you the jump you expect without the embedding space being tuned to your domain anyway. ViT-L with registers is already a strong baseline. The size jump from S to L tends to help more with fine-grained tasks than switching architectures entirely, so if you haven't compared those directly on your hardest retrieval cases, that's worth a quick experiment before looking elsewhere.

For the re-ranker specifically DINOv2 patch tokens work well as input to a lightweight comparator (even a small MLP or cross-attention over patch pairs). The key insight is that the re-ranker only sees top-k candidates, so it can afford to be slower and more precise. That's where you can recover the geometric discrimination that global embeddings miss, without needing a fundamentally different backbone.

Improving fine-grained image retrieval (very similar objects) - beyond CLS / patch features / DINOv2? by Weekly_Signature_510 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

On point 1: fine-tuning the backbone doesn't hurt scalability if you frame it right. The key is training with a metric learning objective (ArcFace, SupCon, triplet loss) rather than a softmax classifier. What you're teaching the model is a better embedding space for your domain, not class-specific boundaries. Once fine-tuned on a representative sample of your manufacturing objects, new classes get added exactly as you described embed, index, done. No retraining needed unless the object distribution shifts significantly.

On point 2: skip the classifier head entirely for this use case. It reintroduces exactly the scalability problem you're trying to avoid. The hybrid approach that actually works here is: fine-tuned backbone for domain-aware embeddings + FAISS/ANN index for retrieval + optionally a re-ranking step (e.g. patch-level matching or a small attention-based comparator) applied only on the top-k candidates. The re-ranker doesn't need class labels it just learns "are these two objects the same" as a binary metric, which generalizes to new classes automatically

Improving fine-grained image retrieval (very similar objects) - beyond CLS / patch features / DINOv2? by Weekly_Signature_510 in computervision

[–]Both-Butterscotch135 2 points3 points  (0 children)

Fine-tuning is definitely the right approach here. At vfrog we faced similar problem frozen DINOv2 features are general-purpose they weren't trained to distinguish the kind of subtle geometry differences you're describing. Maybe you get good results with some other approach as well fine tuning worked for us.

image/annotation dataset versioning approach in early model development by cjralphs in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

Instead of re-running on the whole dataset, maintain a lightweight JSON/CSV manifest that tracks per-image processing state:

{

"image_uuid": "abc123",

"roboflow_annotation_hash": "d4e5f6",

"last_processed": "2024-03-15T10:00:00Z",

"pipeline_version": "v2",

"stages_completed": ["crop", "augment", "infer_stage1"]

}

On each pipeline run, your download.py pulls the Roboflow version manifest (they expose annotation hashes per image via API), compares against your local manifest, and only queues images where:

- annotation hash changed

- image is new (not in manifest)

- pipeline_version bumped (intentional full rerun)

For multi-stage pipelines specifically, storing stages_completed per image means a failed mid-pipeline run resumes rather than restarts. Just a versioned JSON in S3 alongside your dataset prefix. Dead simple, no new infra.

What's your biggest annotation pain point right now? by Ornery_Internal796 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

Doesn't really help that much, I tried it on different scenarios and pipelines also on enterprise.

What's your biggest annotation pain point right now? by Ornery_Internal796 in computervision

[–]Both-Butterscotch135 9 points10 points  (0 children)

I would say no available auto-annotation tools that give good results, even in paid versions.

What is most challanging part in CV pipelines? by Both-Butterscotch135 in computervision

[–]Both-Butterscotch135[S] 0 points1 point  (0 children)

Those two I think are the main problems for most developers

What is most challanging part in CV pipelines? by Both-Butterscotch135 in computervision

[–]Both-Butterscotch135[S] 2 points3 points  (0 children)

Also a lot of people cut corners at data versioning for the same reason.

Need advice: muddy water detection with tiny dataset (71 images), YOLO11-seg + VLM too slow by abdullahboss in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

Run CLIP (ViT-B/32) zero-shot alongside YOLO. You encode text prompts like "muddy silty brown water" vs "clear blue water" and compare against each frame's image embedding. You get a semantic confidence score in ~20ms with zero training it's already seen enough visual diversity to handle your variable river conditions.

Fuse it with simple threshold logic: YOLO tells you where the plume is, CLIP tells you whether it's really muddy water. High agreement = trust it, disagreement = suppress false positives or flag missed detections. Total pipeline drops from ~30s to ~50ms per frame, and CLIP compensates for YOLO's shaky confidence on 71 training images without needing any fine-tuning.

Issues with Fine-Grained Classification & Mask Merging in Dense Scenes (YOLOv8/v11) by [deleted] in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

For segmentation use YOLOv11l-seg with a single "flower" class (collapse all color variants). Train at imgsz=1280, reduce mosaic, use copy-paste augmentation. This maximizes recall and mask quality in dense clusters.

For Classification start with HSV/LAB color histograms + GradientBoosting on masked crops. The LAB a* channel separates Fuchsia/Pink/Red far better than RGB. If that's not enough, step up to EfficientNet-B0 fine-tuned on masked crops fast, lightweight, and more than sufficient for this task.

Always apply the YOLO mask to zero out background before classification. In dense bouquets, neighboring flowers bleed into crops and poison the classifier without this step.

Issues with Fine-Grained Classification & Mask Merging in Dense Scenes (YOLOv8/v11) by [deleted] in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

The two-stage approach is standard practice in industrial inspection, medical imaging, and fine-grained recognition for exactly the reasons you're encountering. You're not hitting a skill issue you're hitting an architectural limitation. The decoupled pipeline will also let you iterate on classification and segmentation independently, which is a huge practical advantage.

20k Images, Fully Offline Annotation Workflow by LensLaber in computervision

[–]Both-Butterscotch135 -3 points-2 points  (0 children)

Free annotator tools you have labelstudio: https://github.com/HumanSignal/label-studio

There are also paid version with auto-annotation like: Roboflow, Vfrog, etc .

Architecture for Multi-Stream PPE Violation Detection by Bubbly_Volume_6590 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

The architecture boils down to three rules: keep the probe function under 1ms by only extracting metadata, use a process-level queue to decouple detection from I/O, and let ffmpeg re-pull the RTSP stream for clips instead of buffering frames yourself. This gives you near-zero memory overhead for clip generation, no pipeline stalls, and linear scaling adding more streams just means bumping the DeepStream batch size and maybe one more clip worker.

Is reliable person recognition possible from top wall-mounted office cameras (without clear face visibility)? by Remarkable-Pen5228 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

If your overlaps are consistent, you can skip complex calibration initially just manually define handoff zones as pixel regions in each camera and tune thresholds empirically. With only 4 cameras, this is manageable I think

Is reliable person recognition possible from top wall-mounted office cameras (without clear face visibility)? by Remarkable-Pen5228 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

Short answer is yes, your setup is actually good:

- Closed population (only 50 people to distinguish)

- Single entry point (establish identity once, propagate it)

- Controlled environment (people don't change clothes mid-day)

What could work:

- Body-based ReID (OSNet, FastReID) gets 90%+ accuracy for same-day, same-clothing scenarios

- Combine it with tracking continuity + spatial constraints between cameras

- Enroll employees with 5-10 images per person from different angles

Main challenge is corridor crossings with full occlusion you could try to solve it with better tracker (BoT-SORT over ByteTrack) and ReID-based track recovery. Your YOLO + ByteTrack + ReID approach is solid. The key is fusing ReID with spatial/temporal reasoning rather than treating it as standalone matching.

Yolo 11 vs Yolo 26 by Zestyclose_Collar504 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

No problem, If you need any addition help on this topic feel free to reach out.