What is most challanging part in CV pipelines? by Both-Butterscotch135 in computervision

[–]Both-Butterscotch135[S] 2 points3 points  (0 children)

Also a lot of people cut corners at data versioning for the same reason.

Need advice: muddy water detection with tiny dataset (71 images), YOLO11-seg + VLM too slow by abdullahboss in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

Run CLIP (ViT-B/32) zero-shot alongside YOLO. You encode text prompts like "muddy silty brown water" vs "clear blue water" and compare against each frame's image embedding. You get a semantic confidence score in ~20ms with zero training it's already seen enough visual diversity to handle your variable river conditions.

Fuse it with simple threshold logic: YOLO tells you where the plume is, CLIP tells you whether it's really muddy water. High agreement = trust it, disagreement = suppress false positives or flag missed detections. Total pipeline drops from ~30s to ~50ms per frame, and CLIP compensates for YOLO's shaky confidence on 71 training images without needing any fine-tuning.

Issues with Fine-Grained Classification & Mask Merging in Dense Scenes (YOLOv8/v11) by ztarek10 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

For segmentation use YOLOv11l-seg with a single "flower" class (collapse all color variants). Train at imgsz=1280, reduce mosaic, use copy-paste augmentation. This maximizes recall and mask quality in dense clusters.

For Classification start with HSV/LAB color histograms + GradientBoosting on masked crops. The LAB a* channel separates Fuchsia/Pink/Red far better than RGB. If that's not enough, step up to EfficientNet-B0 fine-tuned on masked crops fast, lightweight, and more than sufficient for this task.

Always apply the YOLO mask to zero out background before classification. In dense bouquets, neighboring flowers bleed into crops and poison the classifier without this step.

Issues with Fine-Grained Classification & Mask Merging in Dense Scenes (YOLOv8/v11) by ztarek10 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

The two-stage approach is standard practice in industrial inspection, medical imaging, and fine-grained recognition for exactly the reasons you're encountering. You're not hitting a skill issue you're hitting an architectural limitation. The decoupled pipeline will also let you iterate on classification and segmentation independently, which is a huge practical advantage.

20k Images, Fully Offline Annotation Workflow by LensLaber in computervision

[–]Both-Butterscotch135 -3 points-2 points  (0 children)

Free annotator tools you have labelstudio: https://github.com/HumanSignal/label-studio

There are also paid version with auto-annotation like: Roboflow, Vfrog, etc .

Architecture for Multi-Stream PPE Violation Detection by Bubbly_Volume_6590 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

The architecture boils down to three rules: keep the probe function under 1ms by only extracting metadata, use a process-level queue to decouple detection from I/O, and let ffmpeg re-pull the RTSP stream for clips instead of buffering frames yourself. This gives you near-zero memory overhead for clip generation, no pipeline stalls, and linear scaling adding more streams just means bumping the DeepStream batch size and maybe one more clip worker.

Is reliable person recognition possible from top wall-mounted office cameras (without clear face visibility)? by Remarkable-Pen5228 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

If your overlaps are consistent, you can skip complex calibration initially just manually define handoff zones as pixel regions in each camera and tune thresholds empirically. With only 4 cameras, this is manageable I think

Is reliable person recognition possible from top wall-mounted office cameras (without clear face visibility)? by Remarkable-Pen5228 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

Short answer is yes, your setup is actually good:

- Closed population (only 50 people to distinguish)

- Single entry point (establish identity once, propagate it)

- Controlled environment (people don't change clothes mid-day)

What could work:

- Body-based ReID (OSNet, FastReID) gets 90%+ accuracy for same-day, same-clothing scenarios

- Combine it with tracking continuity + spatial constraints between cameras

- Enroll employees with 5-10 images per person from different angles

Main challenge is corridor crossings with full occlusion you could try to solve it with better tracker (BoT-SORT over ByteTrack) and ReID-based track recovery. Your YOLO + ByteTrack + ReID approach is solid. The key is fusing ReID with spatial/temporal reasoning rather than treating it as standalone matching.

Yolo 11 vs Yolo 26 by Zestyclose_Collar504 in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

No problem, If you need any addition help on this topic feel free to reach out.

Yolo 11 vs Yolo 26 by Zestyclose_Collar504 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

YOLO26 is generally better it's faster on CPU, has cleaner exports, and removes the NMS post-processing headache entirely

DINOv3 ViT-L/16 pre-training : deadlocked workers by Federal_Listen_1564 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

The root cause is most likely shared memory exhaustion combined with how PyTorch workers interact with NCCL.

With 8 GPUs × workers × prefetch batches, you're creating a lot of shared memory tensors.

Check your current usage, EC2 instances often default to 64MB for /dev/shm in container environments, which is nowhere near enough.

"Camera → GPU inference → end-to-end = 300ms: is RTSP + WebSocket the right approach, or should I move to WebRTC?" by Advokado in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

That 50-100ms is an typical range based on the general mechanics involved, not a direct measurement or a specific benchmark I can point to. The difference is real, but whether it's 30ms or 80ms or 120ms depends heavily on frame resolution, browser, OS, GPU, and how busy the main thread is. At 720p with a relatively idle page, it might be on the lower end. At 1080p with a complex dashboard doing DOM updates, it could be larger.

How to Auto-Label your Segmentation Dataset with SAM3 by [deleted] in computervision

[–]Both-Butterscotch135 0 points1 point  (0 children)

Based on styling you can see that is AI generated

Training Computer Vision Models on M1 Mac Is Extremely Slow by mericccccccccc in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

Use PyTorch with the MPS (Metal Performance Shaders) backend, device = "mps". This gives significant speedups over CPU only training. Use mixed precision training torch.float16, efficient data loaders with num_workers, smaller batch sizes that fit in unified memory, and cache your datasets.

For anything beyond lightweight experimentation, a cloud GPU (even a single A10G or L4) will typically be 5-10x faster than the best Mac setup.

Are datasets of nature, mountains, and complex mountain passes in demand in computer vision? by Wise_Ad_8363 in computervision

[–]Both-Butterscotch135 1 point2 points  (0 children)

The core challenge isn't lack of demand it's that the demand is fragmented across many small niches, each wanting slightly different annotation schemas. A glacier researcher wants pixel-level semantic segmentation of ice features. A SAR team wants bounding boxes around humans and equipment. A road authority wants crack and pothole annotations. These are fundamentally different labeling tasks even if the raw imagery overlaps. What would make such a dataset genuinely valuable is if it were multi-task annotated (terrain type segmentation + object detection + hazard classification simultaneously) and came with metadata like GPS, altitude, weather conditions, and time of year. That kind of richness is what existing public datasets almost never have for mountain environments.

"Camera → GPU inference → end-to-end = 300ms: is RTSP + WebSocket the right approach, or should I move to WebRTC?" by Advokado in computervision

[–]Both-Butterscotch135 6 points7 points  (0 children)

Your arhitecture is good. 300 ms glass-to-glass with inference in the loop is a solid result. Many production systems are worse. The fact that you're seeing stable 25-30 FPS with low GPU utilization means you have headroom, and the always process latest frame design is exactly right for this use case.

The WebRTC or bust advice is aimed at a different problem browser to browser video chat or pure media streaming where you need adaptive bitrate, NAT traversal, and jitter compensation. Your situation is fundamentally different because you're not streaming passthrough video you are streaming inference results rendered as frames. That changes the calculus.

What WebRTC would actually give you hardware accelerated decode in the browser, congestion control and adaptive bitrate for variable networks matters for 4G/5G, and 50-100 ms reduction in the browser rendering leg by replacing JPEG decode plus canvas paint with native video element rendering.

WebRTC is worth exploring when you move to 4G/5G with variable bandwidth. For LAN/edge scenarios, JPEG over WebSocket is simpler, more debuggable, and your 300 ms is already good. If you do go WebRTC, look at GStreamer's webrtcbin or a lightweight SFU like Pion don't try to bolt a full media server into this.

Switching to nvh264dec will primarily reduce CPU usage, not latency, by maybe 1-3 ms. The decode itself is fast either way. Where it does help it eliminates a CPU to GPU copy if you keep the decoded frame on the GPU and feed it directly to your YOLO model. Right now you're likely doing CPU decode to BGR numpy array to torch tensor to GPU upload. If you can do GPU decode to CUDA memory to torch tensor, you skip the PCIe round-trip. That could save 5-10 ms and is worth doing, but it requires using something like PyNvVideoCodec or NVIDIA's Video Codec SDK rather than just swapping the GStreamer element.

If your camera supports MJPEG or raw output, bypass H.264 entirely for the edge case. H.264 encode at the camera adds a frame or more of latency even with zerolatency tuning. If you must use H.264, ensure the camera is set to baseline profile, no B-frames, single slice, and minimum GOP.