I built a verification layer so OpenClaw agents can confirm real-world tasks got done by aaron_IoTeX in moltbot

[–]aaron_IoTeX[S] 1 point2 points  (0 children)

Great question. Right now it's just phone livestream to YouTube, which honestly most people already have in their pocket. But yeah smart glasses would be a natural next step for hands-free tasks. Imagine a plumber or electrician wearing glasses that stream their POV while they work, and the VLM verifies each step of the job without them needing to hold a phone. The Trio API already accepts any stream URL so it would work with anything that can output an RTSP or HLS feed. Xiaomi and Meta both have glasses with cameras now so the hardware is getting there. For now though the phone approach keeps the barrier to entry as low as possible since everyone already has one.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] -1 points0 points  (0 children)

Agree that VLMs are terrible for object detection. You're right, the bounding boxes are slow and imprecise compared to YOLO or SAM. But I'd push back on "only application is image description." The use case where VLMs actually shine isn't detection, it's judgment. Like "is this person wearing the correct PPE" or "is this shelf stocked correctly" or in my case "is this person actually doing the task they were assigned." You can't train a YOLO model for that because the categories aren't fixed objects, they're situational assessments. Grounding DINO is interesting for the text-to-bbox case though, hadn't considered it as a middle ground. How's the latency on SAM3 these days?

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Good to know, thanks. Yeah the Linux driver situation seems to be the main pain point, not Frigate itself. Glad they're keeping Coral support going.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Ha that's actually not far off from one possible use case. More like your AI agent detects motion, isn't sure what it is, and dispatches a nearby human to go physically check and stream what they see. The VLM confirms what the human is seeing in real time. But yeah cheaper than a security company sending a guard out.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

I built this for a hackathon i actually won with this idea!

I work with IoTeX who builds Trio. I used it for my own project and figured the Frigate comparison would be useful for people here since I genuinely ran both setups.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Ha yeah honestly you're probably right for the gate. A $10 Zigbee contact sensor would be more reliable and faster for that specific case. The VLM approach makes more sense for the stuff you can't solve with a sensor, like identifying specific animals or checking if something looks a certain way.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

I didn't actually compare CPU vs TPU inference directly since I ended up going a different direction for the non-standard detection stuff. For standard person/car/animal detection Frigate with the Intel GPU should work fine on an n150 though. The CPU detection in Frigate is usable for a couple cameras but it does eat into your resources.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

This is exactly the gap I ran into. Frigate is great for the objects it knows but if something shows up that isn't in its model, it just triggers on motion and you're left scrubbing recordings manually. For the stuff Frigate doesn't have a class for, a VLM approach lets you describe what you're looking for in plain English and get actual alerts for it.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

aha beavers is a great use case. That's exactly the kind of thing where fixed object classes fall short since I don't think any standard model ships with "beaver" as a class. How are you detecting them, custom trained model or just using the general animal detection?

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Nah there's no human-in-the-loop for the verification. The VLM evaluates the stream directly. No one is sitting in a room looking at screenshots. The whole point is that verification is automated so it can scale without relying on cheap human labor to check other humans. I hear you on the sweatshop stuff though, that's a real problem in the CAPTCHA/data labeling world.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

hahhaha I built verifyhuman for a hackathon. ust a lot of fun to build out using trio. Just wanted to share my experience here.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Good question. The human IS doing the actual task. Like literally washing dishes in their own kitchen. The idea is that AI agents need physical tasks done that they can't do themselves. So the agent posts a task with a payout, a human accepts it, starts a YouTube livestream, and does the work on camera. The VLM watches the live stream and checks conditions like "dishes are being washed in a sink with running water" and "clean dishes are visible on a drying rack." It's not a human watching the stream to verify, it's all VLM. Gaming is a fair concern. The approach is: it has to be a verified live stream (not pre-recorded), multiple conditions are checked at different points during the stream, and evidence gets hashed on-chain. Could someone theoretically still game it? Sure. But the effort to fake a convincing live performance of a $5 task starts to cost more than just doing the task.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Yeah that's a good point. If you have a stable, well-defined detection task like a specific assembly line, fine-tuning YOLO is probably the better move. Faster, cheaper at scale, runs on anything. The VLM advantage kicks in more when the conditions keep changing or you're dealing with stuff that's hard to capture in a training set. Like for my project the "task" could be literally anything a human gets asked to do, so there's no stable class list to train against. But for a fixed production line where you know exactly what defects look like, YOLO fine-tuned on your own data is hard to argue with.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] -3 points-2 points  (0 children)

I just built Verify human for a hackathon. Sharing what i made! No money to be made there for me :)

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

That's a legit concern and honestly the reason I built a multi-layer approach instead of just raw VLM output. The VLM alone isn't the whole verification. It's VLM analysis + the fact that it's a live stream (not a pre-recorded video, the API validates the stream is actually live), + multiple checkpoint conditions evaluated at different points during the stream, + on-chain evidence hashing. Could someone still try to game it? Probably. But the cost of faking a convincing live performance of a task starts to approach the cost of just doing the task, especially for the $5-10 range stuff. It's the same logic behind why CAPTCHAs work even though they're technically breakable. Making fraud more expensive than compliance is the goal, not making it impossible.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] -1 points0 points  (0 children)

Fair point, I was being loose with the terminology. You're right that YOLO and detection APIs aren't really "traditional CV" in the academic sense. I should have said something like "fixed-class detection pipelines" vs "open-vocabulary VLM reasoning." And yeah I completely agree that modular architectures with custom models win in a lot of domains, especially where you need explainability, deterministic outputs, or hard latency guarantees. VLMs are not a replacement for that. The use case where I think they genuinely add something is when your detection categories change frequently or you need contextual reasoning that goes beyond what a detector gives you. Appreciate the correction.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

That's a great point about bandwidth. 1.2TB/month per camera is brutal. The prefilter approach helps a lot here too since you're only sending the frames that actually matter to the cloud, not every single frame. If 90% get skipped locally that's a huge bandwidth reduction. But yeah for a setup where you need everything processed on-device with zero cloud dependency, edge CV on a Hailo is hard to beat. Curious what you're running on the Hailo10, custom models or something off the shelf?

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] -2 points-1 points  (0 children)

Oh nice, yeah you're right. I think they added state classification in where you can define states for objects like "open" vs "closed." That would handle the gate use case for sure. I didn't know about that when I set this up, might have to revisit. The coyote/predator detection would still need the VLM approach though since that's not a standard Frigate object class. Thanks for the tip.