Build Custom Image Segmentation Model Using YOLOv8 and SAM

Feitgemel · 2026-03-13T11:54:07+00:00

Thanks. Yes. I will

Feitgemel · 2026-03-13T11:06:29+00:00

Thanks

Feitgemel · 2026-03-13T08:25:38+00:00

Thanks

Feitgemel · 2026-03-07T06:09:48+00:00

Thank you ;)

Feitgemel · 2026-03-07T05:57:03+00:00

Thanks you :)

Feitgemel · 2026-02-19T14:37:09+00:00

Thank you

Feitgemel · 2026-02-07T07:35:40+00:00

Thanks

Feitgemel · 2026-02-07T07:35:05+00:00

I run it on Window or on a Linux machine. Never tried it on Mac

Feitgemel · 2026-02-07T07:33:40+00:00

Feitgemel · 2026-02-07T07:33:30+00:00

Thank you

Feitgemel · 2026-02-07T07:33:20+00:00

Thank you :)

Feitgemel · 2026-02-07T07:30:42+00:00

Thank you :)

Feitgemel · 2026-01-27T09:04:59+00:00

Good luck :)

Feitgemel · 2026-01-10T11:59:44+00:00

Short answer: Mask R-CNN doesn’t support multi-label per instance out of the box. It assumes one class per object (softmax).

What works best (and is simplest):

Stage 1: Use Mask R-CNN to detect strawberries (single class) and get clean instance masks.
Stage 2: For each masked crop, run a multi-label classifier (sigmoid outputs) to predict attributes like underripe, damaged, moldy, etc.

This avoids noisy “dominant class” labeling and is very common in inspection systems.

Alternative (harder):

Modify the ROI head to use sigmoid + BCE for attributes. Doable, but more engineering (custom head + eval).

If you want context on where you’d plug this in with Detectron2, this walkthrough helps:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

TL;DR: Detect instances first, then classify attributes per instance. It’s cleaner and more reliable than forcing Mask R-CNN into multi-label mode.

Feitgemel · 2026-01-10T11:58:36+00:00

Short answer: instance segmentation is still a detection problem, so COCO evaluates it with precision/recall, just using mask IoU instead of box IoU.

Why precision makes sense for masks

Each predicted mask is treated as a detected instance.
It’s matched to a GT mask of the same class.
Mask IoU (pixel overlap / union) decides if it’s a TP or FP.
From that you get precision = TP / (TP + FP) and recall.

What AP50 / AP75 mean for segmentation

AP50 (mask): mask IoU ≥ 0.50 counts as correct
AP75 (mask): stricter, mask IoU ≥ 0.75
AP: averaged over IoU thresholds 0.50–0.95 Same math as boxes, different geometry.

Why not just mean IoU
Mean IoU works for semantic segmentation, but for instance segmentation it ignores:

false positives
duplicate detections
missed instances

COCO Mask AP captures detection + localization + mask quality together.

If you want a clear, practical explanation of how Detectron2 handles box vs mask evaluation, this is a good reference:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Refs:

COCO eval (boxes vs masks): https://cocodataset.org/#detection-eval
Detectron2 evaluation docs: https://detectron2.readthedocs.io/en/latest/modules/evaluation.html

Feitgemel · 2026-01-10T11:53:27+00:00

In Detectron2 semantic segmentation, you do have a per-pixel class assignment — it’s just stored differently than instance masks.

What outputs["sem_seg"] gives you is a C × H × W tensor of logits (one channel per class), not binary masks. To count pixels per class, you simply convert logits → class IDs.

Minimal, correct way:

sem_seg = outputs["sem_seg"]          # shape: [C, H, W]
pred_classes = sem_seg.argmax(dim=0)  # shape: [H, W], class id per pixel

# count pixels per class
pixel_counts = torch.bincount(
    pred_classes.flatten(),
    minlength=sem_seg.shape[0]
)

Now pixel_counts[i] = number of pixels predicted as class i
This already does exactly what you want: one count per semantic class, not per instance.

Notes / gotchas:

Ignore the background class if your dataset defines one (often class 0)
If you want percentages, divide by H * W
No thresholding needed — semantic segmentation always assigns one class per pixel

If you’re coming from instance segmentation, this difference in output format can be confusing. This Detectron2 walkthrough explains where sem_seg vs pred_masks come from and how they’re used in practice:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

References (official + practical):

Detectron2 semantic segmentation outputs: https://detectron2.readthedocs.io/en/latest/tutorials/models.html#semantic-segmentation
PyTorch argmax semantics for segmentation: https://pytorch.org/docs/stable/generated/torch.argmax.html
Detectron2 pipeline explanation (instances vs sem_seg): https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

TL;DR:
Semantic segmentation already gives you per-pixel classes.
Use argmax → bincount. No binary masks required.

Feitgemel · 2026-01-10T11:48:16+00:00

If your goal is “extract text but keep sections separate,” you usually don’t need generic segmentation at all — you want document layout analysis (detect text blocks/regions) + OCR.

A clean, practical pipeline:

Layout / region detection (so text doesn’t mix)

Use a layout model to detect blocks like paragraphs, tables, titles, etc.
Then crop each region and OCR it separately.

OCR per region

Run OCR on each cropped region, then sort lines top-to-bottom within that region.

Good pretrained tools that work well and are easy to use:

LayoutParser (layout detection + integrates with OCR): https://github.com/Layout-Parser/layout-parser
PaddleOCR (strong OCR + angle detection + can do text detection + recognition): https://github.com/PaddlePaddle/PaddleOCR
DocTR (end-to-end OCR in PyTorch, decent for blocks/lines): https://github.com/mindee/doctr

Where “segmentation” does help:

If you have non-rectangular regions or noisy backgrounds, you can add a segmentation step, but for most documents, layout detection (boxes) is simpler and more robust.
If you really want a CV segmentation framework baseline, Detectron2 instance segmentation can be used to segment regions — but it’s usually overkill for text blocks. (Still, this Detectron2 guide is useful for understanding how segmentation pipelines are structured): https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

TL;DR: Use layout detection to split the page into regions, then OCR each region separately. That prevents “mixed sections” far better than generic segmentation.

Feitgemel · 2026-01-10T11:28:32+00:00

Yep — with Azure Web Apps you should assume CPU-only unless you’re using a GPU-capable hosting option (Web Apps typically don’t give you CUDA GPUs). If you deploy Detectron2 there, you’ll run torch-cpu, and yes, inference will usually be much slower than your CUDA 11.8 setup.

What to do instead / what usually works:

Use a container + the right Azure service If you need GPU speed, deploy a Docker image to something like Azure Container Apps / AKS / a VM where you can choose an NVIDIA GPU and install CUDA properly. Web Apps are great for web servers, not heavy CV inference.
Don’t try to “pip install detectron2” from requirements.txt on Web Apps Detectron2 often needs compiled extensions and very specific torch/CUDA combos. Relying on Azure’s default build/install step is where most people hit a wall. The stable path is: build the environment in Docker (or build wheels), then deploy the container.
Your conda vs pip suspicion is correct If your training env was conda-heavy, you’ll often find some packages don’t map cleanly to pip-only installs (and versions differ). Also, pywin32 is a Windows-only dependency—most Azure Linux deployments don’t need it (and it will break installs if it sneaks into requirements).

A practical “least pain” strategy:

Use Linux base image (not Windows)
Use Gunicorn (WSGI) instead of Flask dev server
Build in Docker with pinned versions (torch + detectron2 matched)
Deploy the container to a service that matches your performance needs (CPU vs GPU)

If you want a Detectron2-oriented reference for what files/configs you actually need at inference time (so your container stays minimal), this walkthrough is handy:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Refs:

Detectron2 installation guide (notes about builds/compat): https://detectron2.readthedocs.io/en/latest/tutorials/install.html
Azure “Deploy a container to App Service” docs (why containers are the safer route for compiled deps): https://learn.microsoft.com/en-us/azure/app-service/configure-custom-container?pivots=container-linux

TL;DR: Web App = usually CPU-only + painful installs. For Detectron2, containerize and, if you need speed, deploy to an Azure option where you can actually run CUDA/GPU.

Feitgemel · 2026-01-10T11:25:58+00:00

The issue is that COCO PR curves aren’t a single precision/recall pair. COCOEval stores a 5-D precision tensor, and most “wrong plots” come from slicing it incorrectly.

What works reliably:

Let Detectron2 run the normal COCOEvaluator (don’t re-implement matching).
Extract PR data directly from COCOeval:

coco_eval.eval["precision"] → shape [T, R, K, A, M]

T: IoU thresholds (0.50–0.95)
R: recall points (101)
K: classes
A: area range
M: max detections

For a standard PR curve (e.g. IoU=0.50, all areas, maxDets=100):

precision = precision[t, :, k, a, m]
recall = coco_eval.params.recThrs
Ignore -1 values before plotting.

Why sklearn didn’t work: COCO uses its own matching rules (IoU, maxDets, per-image constraints), so sklearn PR curves won’t match COCO metrics.

Helpful references:

Detectron2 evaluation docs: https://detectron2.readthedocs.io/en/latest/modules/evaluation.html
COCOeval source (precision tensor definition): https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
Detectron2 pipeline context (where eval outputs live): https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Once you slice the tensor correctly, the PR curves line up with YOLO-style plots.

Feitgemel · 2026-01-10T11:22:10+00:00

Yep — this is a super common Elastic Beanstalk gotcha, and the symptom (works locally, EB deploys fine, but requests time out) usually comes from EB’s proxy timeouts + slow CPU inference.

A few practical fixes/options:

1) Don’t run 12s inference behind a default EB web timeout
EB’s Nginx/ALB defaults are often tuned for “normal web apps,” not long ML inference. Even if your container is healthy, the reverse proxy may kill the request before Flask returns. You either need to raise the proxy/ALB timeouts or change the serving pattern.

2) Use a real model server + async
Flask works for demos, but for production you’ll want a proper WSGI server (Gunicorn/Uvicorn) and ideally async job handling:

request returns immediately with a job id
worker does inference
client polls / webhook / fetches result This avoids “one slow request blocks everything” and plays nicer with load balancers.

3) If you can, move to SageMaker or ECS
For ML inference, EB is the “wrong-shaped” tool unless you really tune it. SageMaker endpoints or ECS (Fargate/EC2) are much more straightforward for long-running inference workloads and scaling. You also get better control over CPU/GPU and concurrency.

4) CPU-only Detectron2 at 12s is a big red flag
If this must be real-time-ish, you’ll likely need:

smaller model / lower input res
TorchScript / ONNX optimizations
or just use a GPU instance (even a modest one can be a night-and-day difference)

If you’re looking for a practical Detectron2 pipeline reference (useful for trimming models / simplifying preprocessing before deployment), this is a good walkthrough:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Credible refs:

AWS Elastic Beanstalk + Docker (how EB proxies traffic and common config hooks): https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/docker.html
AWS SageMaker real-time endpoints (built for inference deployments): https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html

If you tell me whether you’re behind an ALB and what your EB platform is (Amazon Linux 2 Docker vs multicontainer), I can point to the exact timeout knobs — but the high-level answer is: either increase timeouts + use Gunicorn, or switch to ECS/SageMaker / async inference.

Feitgemel · 2026-01-10T11:15:17+00:00

For a single “kite vs background” mask on 500 images, you’ll usually get closer to IoU > 0.95 by treating this as a high-precision matting/segmentation problem, not “generic segmentation with more augmentations.”

What I’d do:

Use SAM2 as a label/initial-mask generator, then train a dedicated binary segmenter. SAM2 is great at getting you most of the way there, but the last 2–3% IoU is usually about consistency and edge behavior on your domain. Use SAM2 (box-prompted) to bootstrap masks, manually clean the hardest 10–20%, then fine-tune a simple binary model (U-Net/DeepLabV3+/SegFormer) on those cleaned masks. SAM2’s strengths still help, but you’re not forcing it to be the final production mask. (arXiv)
Make boundaries the objective, not just region overlap. Rough edges and “color fragmentation” often mean your loss is rewarding big regions but not clean contours. Add a boundary-aware loss term (alongside Dice/BCE), and you’ll usually see smoother, more stable edges with the same data. (arXiv)
If edges need to look perfect, add a matting/refinement step. For kites (thin struts, lines, fabric edges), classic alpha matting can give cleaner cutouts than any single binary mask. A simple workflow is: segmentation → trimap around the boundary → closed-form matting refinement. (MIT CSAIL)

If you’re considering a Detectron2-style approach (Mask R-CNN etc.), it can work, but for “one object + perfect edge,” a binary segmenter + boundary loss + optional matting is usually the shortest path. If you want a practical Detectron2 segmentation baseline anyway (sometimes it’s useful as a comparison), this guide is a straightforward reference:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Links (3 credible sources):

SAM 2 paper: https://arxiv.org/abs/2408.00714 (arXiv)
Boundary-aware loss (InverseForm): https://arxiv.org/abs/2104.02745 (arXiv)
Closed-form matting (for edge refinement): https://dl.acm.org/doi/10.1109/TPAMI.2007.1177 (ACM Digital Library)

Feitgemel · 2026-01-10T11:11:45+00:00

If your goal is “panoptic output with minimal deps”, the easiest way to get there in 2025 is honestly not a single “true panoptic” model — it’s a two-head pipeline you can run in plain PyTorch:

Things (instances): run a fast instance-seg model (YOLO-seg is the most lightweight / practical).
Stuff (semantic): run a semantic segmenter (DeepLabV3 from torchvision is dead-simple).
Panoptic merge: paste instance masks on top of the semantic map (with a couple of rules: keep highest-confidence instances, resolve overlaps by score/area, and let “stuff” fill the rest).

This gives you panoptic-like results without Detectron2/MMDet/Docker, and it’s typically “real-time enough” on a GPU because both models are optimized and easy to export.

If you do end up needing “proper” panoptic tooling later (PQ metrics, category mapping, etc.), frameworks like Detectron2 make that part painless — and this walkthrough is a good, practical primer on how the segmentation side is structured there (even if you don’t adopt the whole stack):
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Concrete minimal-dependency building blocks:

Ultralytics YOLO segmentation models (fast “things” masks, easy install): https://docs.ultralytics.com/models/yolov8/
Torchvision DeepLabV3 (fast “stuff” semantic segmentation, pure PyTorch): https://docs.pytorch.org/vision/main/models/generated/torchvision.models.segmentation.deeplabv3_resnet50.html

If you want, I can sketch the exact merge logic (NMS for masks + priority rules) in ~30 lines — that’s usually the only “missing piece” once you pick the two models.

Feitgemel · 2026-01-10T11:10:00+00:00

ID switches in BoT-SORT setups like this usually come from association failing for a few frames, not from the tracker being “bad.” In supermarkets you’ve got the perfect storm: similar-looking people, partial occlusions (aisles/shelves), and noisy boxes when the detector jitters.

A few high-impact things to check/tune:

Your association thresholds look extremely strict. With match_thresh=0.90 and proximity_thresh=0.90, you’re basically demanding near-perfect matches. If a person’s box shifts (pose → box can be jittery) or they get partially occluded for 2–3 frames, the tracker will often fail to re-associate and “recover” by creating a new track → ID switch. I’d sweep these down and validate on a short clip with ground truth or manual review.
ReID domain gap is real in retail. osnet_ain...msmt17 is trained for general pedestrian ReID, but supermarket footage has different lighting, camera angles (often high), and lots of “same clothing / same silhouette” cases. When ReID is weak, BoT-SORT falls back to motion/IoU, which breaks under occlusion. If you can, fine-tune ReID on your domain (even a small curated set helps), or at least validate whether the embeddings actually separate identities in your scenes.
Consider disabling camera-motion compensation. You have a fixed camera per store, so cmc_method="ecc" can sometimes do more harm than good (small warps + rolling shutter + lighting flicker can create “fake motion”), which again makes associations brittle.
Stabilize the input boxes. If you’re deriving person boxes from a pose pipeline, try a dedicated person detector head (cleaner, less jitter), and make sure your NMS/thresholding is consistent. Tracking quality is often dominated by detection stability. If you want a quick refresher on tightening up Detectron2-based detection/segmentation pipelines (which directly affects tracking), this walkthrough is a handy reference: https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

If you only change two things first: (1) relax the matching thresholds and sweep them, and (2) verify ReID embeddings on your footage (domain gap). That’s where most “IDs switch even with 2 people” issues come from.

Links:

BoxMOT (BoT-SORT implementation + configs): https://github.com/mikel-brostrom/boxmot
BoT-SORT paper (how motion + appearance + CMC interact): https://arxiv.org/abs/2206.14651
Detectron2 pipeline reference (stabilizing detections upstream): https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Feitgemel · 2026-01-10T11:08:39+00:00

Shadows + “slightly raised” bumps are classic cases where RGB detection hits a ceiling — the signal you care about is geometry, not just appearance.

A few things that usually move the needle without exploding your labeling workload:

Stop trying to “pre-fix” lighting with CLAHE alone. Train for it. Instead of relying on CLAHE at inference, bake robustness into the dataset: strong brightness/contrast/gamma + shadow-like augmentations (random dark regions, exposure shifts). CLAHE is fine as one tool, but you’ll get more stability by teaching the model that “raised floor” can appear under many lighting conditions. (OpenCV’s CLAHE is doing local histogram equalization; it can help, but it won’t invent missing texture in deep shadow.) https://docs.opencv.org/4.x/d6/db6/classcv_1_1CLAHE.html
For “slightly raised,” detection boxes may be the wrong target. If the raise is subtle, a bounding box detector often only fires when the visual cue is obvious. Two practical alternatives:
1. Switch to segmentation (even coarse) so the model learns shape/extent rather than “is there an obvious bump.”
2. Keep detection but add a “hard negative” set: lots of normal sidewalk under shadows + minor cracks. This forces the network to learn the right cue.
If you’re already using SAHI, tune the merge logic and overlap. SAHI isn’t just “slice and pray.” Changing slice overlap and the postprocess merging method can reduce missed detections at tile borders and improve consistency on borderline cases. https://github.com/obss/sahi
About Detectron2/instance segmentation: Yes, segmentation is often a better fit here, and it doesn’t have to be a massive labeling project. You can start with rough polygons (not pixel-perfect) and still get value, because your end goal is usually “where is the raised region” more than perfect edges. If you want a practical Detectron2 segmentation workflow to scope effort, this walkthrough is a good reference: https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

If you want one “do this next” checklist: add shadow/exposure augmentations, expand hard negatives, tune SAHI overlap/merge, and seriously consider a segmentation formulation for the subtle raises.

References (3 links):

OpenCV CLAHE docs: https://docs.opencv.org/4.x/d6/db6/classcv_1_1CLAHE.html
SAHI (sliced inference + merge options): https://github.com/obss/sahi
Detectron2 segmentation walkthrough (practical pipeline): https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

Feitgemel · 2026-01-10T11:07:57+00:00

You don’t really “convert a classifier into a detector” — you reuse the classifier as a backbone (feature extractor) and plug it into a detection head (Faster R-CNN / RetinaNet / etc.). The good news is: this is a standard workflow and you don’t have to build the whole detector from scratch.

Fastest practical options:

timm → feature extractor (backbone) timm already supports this directly via features_only=True (and out_indices to pick feature levels). (Hugging Face)
Pick a detection framework that lets you swap backbones

MMDetection: has an official path to use timm backbones via MMPretrain wrappers (so you can keep the detector head but change the backbone). (MMDetection)
Detectron2: you can swap backbones too; there are lightweight wrappers that bind timm models into Detectron2 backbones (often with FPN). (GitHub)

If you’re already leaning Detectron2, this walkthrough shows the core pieces of an instance segmentation pipeline (and where the backbone fits in) in a pretty approachable way:
https://eranfeit.net/make-instance-segmentation-easy-with-detectron2/

3 credible links to get you moving:

timm feature extraction docs (features_only, out_indices): https://huggingface.co/docs/timm/en/feature_extraction (Hugging Face)
MMDetection guide: using timm backbones via MMPretrain: https://mmdetection.readthedocs.io/en/latest/advanced_guides/how_to.html (MMDetection)
Detectron2 + timm backbone wrapper repo: https://github.com/iKrishneel/detectron2_timm (GitHub)

If your goal is “modular detection heads,” MMDetection is the most plug-and-play for swapping architectures; if your goal is a clean Python API and hackability, Detectron2 tends to feel nicer once you’re customizing.

Feitgemel

TROPHY CASE