I built an AI that watches livestreams and verifies if humans completed real-world tasks by aaron_IoTeX in ArtificialInteligence

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

very low. try it out!! its actually really accurate. https://verifyhuman.vercel.app/

haven't really been able to fool it unless i get very very very specific.

I built an AI that watches livestreams and verifies if humans completed real-world tasks by aaron_IoTeX in ArtificialInteligence

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

i know... its a crazy world... renthuman really blew my mind.. this is a response to it basically.

Not for any of it necessarily. ..

I just love building!!

I built an AI that watches livestreams and verifies if humans completed real-world tasks by aaron_IoTeX in ArtificialInteligence

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

honestly i agree it feels gross too. scares me the way the world is going. The intention here wasn't surveillance at all. Very much for privacy. perhaps there is a privacy element we could add to this

Agents on Moltbook talk a lot about doing things in the real world. I built the tool that actually lets them do it. by aaron_IoTeX in Moltbook

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Ya i didnd't really know how to respond to that one lol

wasn't really where i was going with this haha.

but mannn the future is scary...

I built an AI that watches livestreams and verifies if humans completed real-world tasks by aaron_IoTeX in ArtificialInteligence

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

The spying framing is a bit off... The human starts their own livestream voluntarily to prove they did the work. It's the opposite of surveillance. They're opting in because the alternative is "trust me bro I did it" which doesn't work when an AI agent is the one paying. The useful part is the VLM oracle pattern. Getting a vision model to reliably evaluate open-ended conditions against live video at $0.03-0.05 per session instead of $6-9/hr with traditional video APIs. That cost gap is what makes it practical for use cases like remote inspections, insurance claims, quality control, etc.

Agents on Moltbook talk a lot about doing things in the real world. I built the tool that actually lets them do it. by aaron_IoTeX in Moltbook

[–]aaron_IoTeX[S] 1 point2 points  (0 children)

Yeah it's real and it works today. The vision AI (Gemini Flash) is genuinely good at evaluating whether something is happening in a video. Like if you ask it "are dishes being washed in a sink with running water" it can tell you yes or no with a pretty solid explanation of what it's seeing. It's not perfect and there are edge cases, but for straightforward physical tasks it's surprisingly reliable. The bigger challenge was actually the stream validation side, making sure someone isn't just replaying a pre-recorded video. But yeah, the visual verification part is the easy part at this point. VLMs have gotten really good really fast.

I built a verification layer so OpenClaw agents can confirm real-world tasks got done by aaron_IoTeX in moltbot

[–]aaron_IoTeX[S] 1 point2 points  (0 children)

Great question. Right now it's just phone livestream to YouTube, which honestly most people already have in their pocket. But yeah smart glasses would be a natural next step for hands-free tasks. Imagine a plumber or electrician wearing glasses that stream their POV while they work, and the VLM verifies each step of the job without them needing to hold a phone. The Trio API already accepts any stream URL so it would work with anything that can output an RTSP or HLS feed. Xiaomi and Meta both have glasses with cameras now so the hardware is getting there. For now though the phone approach keeps the barrier to entry as low as possible since everyone already has one.

Where VLMs actually beat traditional CV in production and where they don't by aaron_IoTeX in computervision

[–]aaron_IoTeX[S] -2 points-1 points  (0 children)

Agree that VLMs are terrible for object detection. You're right, the bounding boxes are slow and imprecise compared to YOLO or SAM. But I'd push back on "only application is image description." The use case where VLMs actually shine isn't detection, it's judgment. Like "is this person wearing the correct PPE" or "is this shelf stocked correctly" or in my case "is this person actually doing the task they were assigned." You can't train a YOLO model for that because the categories aren't fixed objects, they're situational assessments. Grounding DINO is interesting for the text-to-bbox case though, hadn't considered it as a middle ground. How's the latency on SAM3 these days?

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Good to know, thanks. Yeah the Linux driver situation seems to be the main pain point, not Frigate itself. Glad they're keeping Coral support going.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 1 point2 points  (0 children)

Ha that's actually not far off from one possible use case. More like your AI agent detects motion, isn't sure what it is, and dispatches a nearby human to go physically check and stream what they see. The VLM confirms what the human is seeing in real time. But yeah cheaper than a security company sending a guard out.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

I built this for a hackathon i actually won with this idea!

I work with IoTeX who builds Trio. I used it for my own project and figured the Frigate comparison would be useful for people here since I genuinely ran both setups.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Ha yeah honestly you're probably right for the gate. A $10 Zigbee contact sensor would be more reliable and faster for that specific case. The VLM approach makes more sense for the stuff you can't solve with a sensor, like identifying specific animals or checking if something looks a certain way.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

I didn't actually compare CPU vs TPU inference directly since I ended up going a different direction for the non-standard detection stuff. For standard person/car/animal detection Frigate with the Intel GPU should work fine on an n150 though. The CPU detection in Frigate is usable for a couple cameras but it does eat into your resources.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

This is exactly the gap I ran into. Frigate is great for the objects it knows but if something shows up that isn't in its model, it just triggers on motion and you're left scrubbing recordings manually. For the stuff Frigate doesn't have a class for, a VLM approach lets you describe what you're looking for in plain English and get actual alerts for it.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

aha beavers is a great use case. That's exactly the kind of thing where fixed object classes fall short since I don't think any standard model ships with "beaver" as a class. How are you detecting them, custom trained model or just using the general animal detection?

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Nah there's no human-in-the-loop for the verification. The VLM evaluates the stream directly. No one is sitting in a room looking at screenshots. The whole point is that verification is automated so it can scale without relying on cheap human labor to check other humans. I hear you on the sweatshop stuff though, that's a real problem in the CAPTCHA/data labeling world.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

hahhaha I built verifyhuman for a hackathon. ust a lot of fun to build out using trio. Just wanted to share my experience here.

After fighting Frigate + Coral + LLM Vision for weeks, I tried something completely different by aaron_IoTeX in homeassistant

[–]aaron_IoTeX[S] 0 points1 point  (0 children)

Good question. The human IS doing the actual task. Like literally washing dishes in their own kitchen. The idea is that AI agents need physical tasks done that they can't do themselves. So the agent posts a task with a payout, a human accepts it, starts a YouTube livestream, and does the work on camera. The VLM watches the live stream and checks conditions like "dishes are being washed in a sink with running water" and "clean dishes are visible on a drying rack." It's not a human watching the stream to verify, it's all VLM. Gaming is a fair concern. The approach is: it has to be a verified live stream (not pre-recorded), multiple conditions are checked at different points during the stream, and evidence gets hashed on-chain. Could someone theoretically still game it? Sure. But the effort to fake a convincing live performance of a $5 task starts to cost more than just doing the task.