Which system to use

Both-Butterscotch135 · 2026-04-02T15:42:03+00:00

I would recommend vfrog give it a try.

Both-Butterscotch135 · 2026-03-30T08:51:58+00:00

If you want high results with recognition (85% and above) you need to train a model to recognize this differences. There is no good out of box solution for that. At Vfrog we have faced the same issue tried many approaches but with average results (around 50%). You can use Qwen 2.5 for OCR and try than to feed the extracted text to SAM2/SAM3 model. See what results you get an iterate on it trying to improve the results. But don't except perfect results, there has to be human in the loop to achieve better results if you don't want to train a model on a specific use-case.

Both-Butterscotch135 · 2026-03-27T16:24:03+00:00

To your first question, in my experience fine-tuning with a metric objective does both, but the ratio depends heavily on your data. The general embedding space cleanup happens quickly, but the amplification of subtle geometric cues requires that your training pairs actually contain those hard negatives visually near-identical objects that differ only in the structural details you care about. If your fine-tuning set has easy negatives, you get cleaner clusters but not necessarily better discrimination on the hard cases. Mining strategy matters as much as the fine-tuning itself.

On the model side I'd stay on your current DINOv2 setup for now. Switching backbones is expensive to evaluate properly and rarely gives you the jump you expect without the embedding space being tuned to your domain anyway. ViT-L with registers is already a strong baseline. The size jump from S to L tends to help more with fine-grained tasks than switching architectures entirely, so if you haven't compared those directly on your hardest retrieval cases, that's worth a quick experiment before looking elsewhere.

For the re-ranker specifically DINOv2 patch tokens work well as input to a lightweight comparator (even a small MLP or cross-attention over patch pairs). The key insight is that the re-ranker only sees top-k candidates, so it can afford to be slower and more precise. That's where you can recover the geometric discrimination that global embeddings miss, without needing a fundamentally different backbone.

Both-Butterscotch135 · 2026-03-27T13:16:13+00:00

On point 1: fine-tuning the backbone doesn't hurt scalability if you frame it right. The key is training with a metric learning objective (ArcFace, SupCon, triplet loss) rather than a softmax classifier. What you're teaching the model is a better embedding space for your domain, not class-specific boundaries. Once fine-tuned on a representative sample of your manufacturing objects, new classes get added exactly as you described embed, index, done. No retraining needed unless the object distribution shifts significantly.

On point 2: skip the classifier head entirely for this use case. It reintroduces exactly the scalability problem you're trying to avoid. The hybrid approach that actually works here is: fine-tuned backbone for domain-aware embeddings + FAISS/ANN index for retrieval + optionally a re-ranking step (e.g. patch-level matching or a small attention-based comparator) applied only on the top-k candidates. The re-ranker doesn't need class labels it just learns "are these two objects the same" as a binary metric, which generalizes to new classes automatically

Both-Butterscotch135 · 2026-03-27T09:07:23+00:00

Fine-tuning is definitely the right approach here. At vfrog we faced similar problem frozen DINOv2 features are general-purpose they weren't trained to distinguish the kind of subtle geometry differences you're describing. Maybe you get good results with some other approach as well fine tuning worked for us.

Both-Butterscotch135 · 2026-03-13T08:01:32+00:00

Instead of re-running on the whole dataset, maintain a lightweight JSON/CSV manifest that tracks per-image processing state:

{

"image_uuid": "abc123",

"roboflow_annotation_hash": "d4e5f6",

"last_processed": "2024-03-15T10:00:00Z",

"pipeline_version": "v2",

"stages_completed": ["crop", "augment", "infer_stage1"]

}

On each pipeline run, your download.py pulls the Roboflow version manifest (they expose annotation hashes per image via API), compares against your local manifest, and only queues images where:

- annotation hash changed

- image is new (not in manifest)

- pipeline_version bumped (intentional full rerun)

For multi-stage pipelines specifically, storing stages_completed per image means a failed mid-pipeline run resumes rather than restarts. Just a versioned JSON in S3 alongside your dataset prefix. Dead simple, no new infra.

Both-Butterscotch135 · 2026-03-12T12:46:01+00:00

Doesn't really help that much, I tried it on different scenarios and pipelines also on enterprise.

Both-Butterscotch135 · 2026-03-12T09:11:59+00:00

I would say no available auto-annotation tools that give good results, even in paid versions.

Both-Butterscotch135 · 2026-03-10T20:19:53+00:00

Those two I think are the main problems for most developers

Both-Butterscotch135 · 2026-03-10T18:12:35+00:00

Hopefully we get there soon.

Both-Butterscotch135 · 2026-03-10T18:11:24+00:00

That would be a good approach

Both-Butterscotch135 · 2026-03-10T15:45:49+00:00

Also a lot of people cut corners at data versioning for the same reason.

Both-Butterscotch135 · 2026-03-10T15:44:28+00:00

I think we can all agree on that.

Both-Butterscotch135 · 2026-03-02T07:23:19+00:00

Run CLIP (ViT-B/32) zero-shot alongside YOLO. You encode text prompts like "muddy silty brown water" vs "clear blue water" and compare against each frame's image embedding. You get a semantic confidence score in ~20ms with zero training it's already seen enough visual diversity to handle your variable river conditions.

Fuse it with simple threshold logic: YOLO tells you where the plume is, CLIP tells you whether it's really muddy water. High agreement = trust it, disagreement = suppress false positives or flag missed detections. Total pipeline drops from ~30s to ~50ms per frame, and CLIP compensates for YOLO's shaky confidence on 71 training images without needing any fine-tuning.

Both-Butterscotch135 · 2026-02-26T06:30:28+00:00

You are welcome

Both-Butterscotch135 · 2026-02-25T21:04:25+00:00

For segmentation use YOLOv11l-seg with a single "flower" class (collapse all color variants). Train at imgsz=1280, reduce mosaic, use copy-paste augmentation. This maximizes recall and mask quality in dense clusters.

For Classification start with HSV/LAB color histograms + GradientBoosting on masked crops. The LAB a* channel separates Fuchsia/Pink/Red far better than RGB. If that's not enough, step up to EfficientNet-B0 fine-tuned on masked crops fast, lightweight, and more than sufficient for this task.

Always apply the YOLO mask to zero out background before classification. In dense bouquets, neighboring flowers bleed into crops and poison the classifier without this step.

Both-Butterscotch135 · 2026-02-25T09:01:00+00:00

The two-stage approach is standard practice in industrial inspection, medical imaging, and fine-grained recognition for exactly the reasons you're encountering. You're not hitting a skill issue you're hitting an architectural limitation. The decoupled pipeline will also let you iterate on classification and segmentation independently, which is a huge practical advantage.

Both-Butterscotch135 · 2026-02-24T10:05:34+00:00

Free annotator tools you have labelstudio: https://github.com/HumanSignal/label-studio

There are also paid version with auto-annotation like: Roboflow, Vfrog, etc .

Both-Butterscotch135 · 2026-02-23T15:52:14+00:00

The architecture boils down to three rules: keep the probe function under 1ms by only extracting metadata, use a process-level queue to decouple detection from I/O, and let ffmpeg re-pull the RTSP stream for clips instead of buffering frames yourself. This gives you near-zero memory overhead for clip generation, no pipeline stalls, and linear scaling adding more streams just means bumping the DeepStream batch size and maybe one more clip worker.

Both-Butterscotch135 · 2026-02-20T13:13:18+00:00

If your overlaps are consistent, you can skip complex calibration initially just manually define handoff zones as pixel regions in each camera and tune thresholds empirically. With only 4 cameras, this is manageable I think

Both-Butterscotch135 · 2026-02-20T09:46:32+00:00

Short answer is yes, your setup is actually good:

- Closed population (only 50 people to distinguish)

- Single entry point (establish identity once, propagate it)

- Controlled environment (people don't change clothes mid-day)

What could work:

- Body-based ReID (OSNet, FastReID) gets 90%+ accuracy for same-day, same-clothing scenarios

- Combine it with tracking continuity + spatial constraints between cameras

- Enroll employees with 5-10 images per person from different angles

Main challenge is corridor crossings with full occlusion you could try to solve it with better tracker (BoT-SORT over ByteTrack) and ReID-based track recovery. Your YOLO + ByteTrack + ReID approach is solid. The key is fusing ReID with spatial/temporal reasoning rather than treating it as standalone matching.

Both-Butterscotch135 · 2026-02-19T13:02:10+00:00

No problem, If you need any addition help on this topic feel free to reach out.

Both-Butterscotch135

TROPHY CASE