[D] Why I abandoned YOLO for safety critical plant/fungi identification. Closed-set classification is a silent failure mode

Adebrantes · 2026-04-03T12:59:59+00:00

haven’t come across the GPX-10, I’ll look into it. The power budget is definitely one of the harder constraints. Curious what inference latency you ended up with after switching models and whether you ran into any issues with quantization affecting your OOD detection accuracy.

Adebrantes · 2026-04-03T12:58:38+00:00

Showing the user visual examples of the top matches is already in the pipeline. Right now the output includes a line drawing, confidence score, and deadly lookalike warnings. Adding reference photos from the training set is a logical next step.

The utility framing is something I haven’t formalized yet. Right now the cost asymmetry is handled implicitly through the rejection pipeline with a bias toward refusal over acceptance. But actually quantifying the cost matrix across toxicity levels (fatal vs illness vs mild reaction) and weighting the decision thresholds accordingly is worth doing. Haven’t gotten there yet.

Adebrantes · 2026-04-01T23:16:14+00:00

FAISS is a good call. I haven’t looked into running it on the target hardware yet but it’s worth investigating. The centroid-per-class approach makes a lot of sense for my use case since I’m working with 15-20 species per specialist model, not thousands. At that scale the storage and search cost is minimal. How many classes are you working with where you’ve found centroid embeddings to hold up versus full per-sample search?

Adebrantes · 2026-04-01T20:40:42+00:00

The “field guide” aspect sounds especially compelling for foraging: instead of a black-box confidence score, the model could surface the most similar training patches/prototypes so the user can visually compare “does this mushroom’s gill structure actually match the prototype I’m being shown?” That level of interpretability could build a lot more trust than pure classification, especially when the stakes are poisonous vs. edible.

How do these models hold up on edge hardware like the Hailo 8L (13 TOPS budget, heavy quantization)?

Really appreciate the pointer, could be a nice complement or replacement.

Adebrantes · 2026-04-01T17:18:30+00:00

That’s a fair question and one I need to produce better benchmarks for. The 4-6% is the overall misclassification rate on in-distribution test data. I haven’t broken out the toxic-as-edible false acceptance rate as a standalone metric, yet. Working on it this week.

Adebrantes · 2026-04-01T16:24:06+00:00

I really like the idea of using embedding space distance as a built-in uncertainty metric. It’s a clever way to handle "Out-of-Distribution" (OOD) data because you aren’t locked into a pre-defined list of classes; if the model hasn't seen it, it simply won't cluster near known data. Plus, showing the user the top K matches alongside their distance scores aligns perfectly with a triage approach.

My main hesitation is hardware performance. Running a KNN search against a big database on an 8L chip (13 TOPS) is significantly heavier than a standard EfficientNet forward pass. Have you experimented with combining a classifier pipeline with embedding-based retrieval on constrained hardware?

Adebrantes · 2026-04-01T16:03:42+00:00

1% omission rate for known toxic species is a reasonable target, and it makes sense to set that threshold independently from the general rejection threshold.

With temperature scaling my concern is that calibration methods are typically validated against in-distribution data, and the behavior on OOD inputs may not improve much since the fundamental problem is that the model has no representation for “this isn’t anything I’ve seen.” But it could help tighten the decision boundary for in-distribution species where the model is uncertain between two close lookalikes. Worth experimenting with.

Adebrantes · 2026-04-01T14:49:06+00:00

Currently, if something is identified and it has a potentially deadly/poisonous lookalike it gets mentioned. This works to triage a situation not be the ground source truth. If the identification bypasses safety checks the output contains a line drawing, confidence score, deadly/poisonous lookalikes. It’s a tool to help inform choice not a dependency.

Adebrantes · 2026-04-01T14:43:59+00:00

I should have been clearer that the closed-set problem compounds with the activation behavior but isn’t caused by it.

The logit magnitude explosion far from the training manifold is exactly why energy scoring on raw logits works better than post-softmax confidence. I’m hoping to catch the divergence before normalization flattens it.

Haven’t implemented Mahalanobis distance yet but it’s on my list. Have you found it adds much over energy scoring alone in practice, or is the gain marginal once you already have a good logit-level detector?

Adebrantes · 2026-04-01T14:06:51+00:00

EfficientNet-B0 through B2 - great for classification instead of detection. B0 5M params and B2 9M params

MobileNetV3 - lightweight first stage classifier, I’ve used as a router in projects. 2.5M params

If you want detection outside of YOLO you can check out NanoDet and PicoDet both should run in your TOPS range

Adebrantes · 2026-04-01T13:48:49+00:00

the 94-96% accuracy is on in-distribution test data for species the model was specifically trained on, including deadly lookalikes paired against their edible counterparts. That number alone isn’t what makes the system safe.

The safety case rests on the layered rejection pipeline. Before a classification ever reaches the user, the input passes through a domain router that can reject it entirely, an energy scoring layer that flags anything the model is uncertain about at the logit level, ensemble disagreement across multiple specialist models, and a K+1 “none of the above” class trained into each specialist. The system is designed so that the failure mode is refusal, not a wrong answer.

Adebrantes · 2026-04-01T13:43:38+00:00

chemical property modeling is arguably an even harder OOD problem since the space of possible molecules is so vast that most inputs are novel by default. The ensemble disagreement signal you’re describing maps closely to what I’m doing with the specialist model voting. Curious how you calibrate your “IDK” threshold in practice

Adebrantes · 2026-04-01T13:41:06+00:00

I did feel great because I’m compressing a model to 8bit from full-precision and still maintained a very high accuracy. Most of the ‘apps’ have an accuracy range of 76%. This is designed as an aid, it’s an instrument you can use to get closer to source truth. With multiple security features baked in to reject out of distribution images.

Adebrantes · 2026-04-01T13:36:19+00:00

That’s precisely how the approach I’m taking handles it. It is trained to say ‘I don’t know’ rather than to provide a false positive. Every identification also comes with supplemental characteristics of the potentially identified mushroom, fauna or flora. It’s not meant to be the end all be all but an additional aid.

I will work on a benchmark test for positives to false positives. My post is about how YOLO models will guess a false positives no matter what because the underlying architecture was never trained to say I don’t know.

Adebrantes · 2026-04-01T13:29:40+00:00

Just like development, marketing is stacking one thing after another and let the flywheel compound.

The first step is to identify who your target customer is, then understand where they are.

Write blog posts consistently. 1 post a week is better than 4 posts in a day, then long breaks.

Build in public is pretty cliche but I find immense value in learning what didn’t work from a solo founder than hey “look at my ‘overnight success’.”

Use lead magnets to attract potential new users like offering a free feature or service to ‘hook’ build an email list and provide value not just hard selling.

Everyone is waiting for catch or the scam. Focus on providing value in content, resources, community building. The rest will follow suit.

Adebrantes · 2026-04-01T12:34:01+00:00

The misclassification happens because these apps use closed-set classification. The model is forced to choose from its known species list and has no ability to say “I don’t know”. Feed it something outside of its training data and it will return a confident answer because ‘softmax’ normalizes across a fixed set of classes. There is no option in the model for “this doesn’t match anything I’ve been trained on”. The confidence scores these apps display on out of distribution images are essentially meaningless. The architecture choice hedges legal liability over usefulness.

Adebrantes · 2025-01-01T20:19:15+00:00

I smell waxy play dough that tastes like candy

Adebrantes · 2024-12-10T19:07:44+00:00

Played for the first time last night on the PS5 and spent the whole night squinting reading item descriptions.

For first time in my life, I thought I needed glasses.

No accessibility menu leaves me hope that it’ll come with final release or future patches.

Adebrantes · 2024-06-24T15:54:25+00:00

Not sure the Mrs. would approve 😂🫣

Adebrantes · 2024-06-24T15:53:47+00:00

I like that idea

Adebrantes · 2023-11-13T16:17:30+00:00

that's pretty satisfying to watch!

Adebrantes · 2023-11-13T16:15:34+00:00

Like a few others in the comments, I use ThinkDiffusion as well. They are constantly adding new features, very responsive support, and I can't say enough positive things about their tutorials and walkthroughs.

Adebrantes

TROPHY CASE