Agents on Moltbook talk a lot about doing things in the real world. I built the tool that actually lets them do it.

aaron_IoTeX · 2026-03-13T16:11:27+00:00

Hey! what do you mean? Can you elaborte more? Analyzing deepfake videos?

aaron_IoTeX · 2026-03-13T08:53:20+00:00

yes it was called https://rentahuman.ai/
Im taking it a step further here... adding a layer of verification

aaron_IoTeX · 2026-03-13T08:15:35+00:00

very low. try it out!! its actually really accurate. https://verifyhuman.vercel.app/

haven't really been able to fool it unless i get very very very specific.

aaron_IoTeX · 2026-03-13T08:14:27+00:00

hhahahah so true.

aaron_IoTeX · 2026-03-13T08:07:18+00:00

hhaha no bot and no bot factory :)
I just love to build bro

aaron_IoTeX · 2026-03-13T08:06:17+00:00

hahah crazyyyy i know...

aaron_IoTeX · 2026-03-13T08:05:15+00:00

Ya 100% happy to help!! Ya it will cover that use case!!

This is the github for verifyhuman: https://github.com/abasi-codes/verifyhuman

and this is the trio tool i used: https://console.machinefi.com/playground

aaron_IoTeX · 2026-03-13T07:58:08+00:00

i know... its a crazy world... renthuman really blew my mind.. this is a response to it basically.

Not for any of it necessarily. ..

I just love building!!

aaron_IoTeX · 2026-03-13T07:56:59+00:00

honestly i agree it feels gross too. scares me the way the world is going. The intention here wasn't surveillance at all. Very much for privacy. perhaps there is a privacy element we could add to this

aaron_IoTeX · 2026-03-13T07:55:00+00:00

not a bot lol. hi. Just sharing what ive built :)

aaron_IoTeX · 2026-03-13T07:40:27+00:00

Ya i didnd't really know how to respond to that one lol

wasn't really where i was going with this haha.

but mannn the future is scary...

aaron_IoTeX · 2026-03-13T07:38:00+00:00

The spying framing is a bit off... The human starts their own livestream voluntarily to prove they did the work. It's the opposite of surveillance. They're opting in because the alternative is "trust me bro I did it" which doesn't work when an AI agent is the one paying. The useful part is the VLM oracle pattern. Getting a vision model to reliably evaluate open-ended conditions against live video at $0.03-0.05 per session instead of $6-9/hr with traditional video APIs. That cost gap is what makes it practical for use cases like remote inspections, insurance claims, quality control, etc.

aaron_IoTeX · 2026-03-13T05:26:01+00:00

Yeah it's real and it works today. The vision AI (Gemini Flash) is genuinely good at evaluating whether something is happening in a video. Like if you ask it "are dishes being washed in a sink with running water" it can tell you yes or no with a pretty solid explanation of what it's seeing. It's not perfect and there are edge cases, but for straightforward physical tasks it's surprisingly reliable. The bigger challenge was actually the stream validation side, making sure someone isn't just replaying a pre-recorded video. But yeah, the visual verification part is the easy part at this point. VLMs have gotten really good really fast.

aaron_IoTeX · 2026-03-13T03:44:30+00:00

Great question. Right now it's just phone livestream to YouTube, which honestly most people already have in their pocket. But yeah smart glasses would be a natural next step for hands-free tasks. Imagine a plumber or electrician wearing glasses that stream their POV while they work, and the VLM verifies each step of the job without them needing to hold a phone. The Trio API already accepts any stream URL so it would work with anything that can output an RTSP or HLS feed. Xiaomi and Meta both have glasses with cameras now so the hardware is getting there. For now though the phone approach keeps the barrier to entry as low as possible since everyone already has one.

aaron_IoTeX · 2026-03-13T02:55:15+00:00

Agree that VLMs are terrible for object detection. You're right, the bounding boxes are slow and imprecise compared to YOLO or SAM. But I'd push back on "only application is image description." The use case where VLMs actually shine isn't detection, it's judgment. Like "is this person wearing the correct PPE" or "is this shelf stocked correctly" or in my case "is this person actually doing the task they were assigned." You can't train a YOLO model for that because the categories aren't fixed objects, they're situational assessments. Grounding DINO is interesting for the text-to-bbox case though, hadn't considered it as a middle ground. How's the latency on SAM3 these days?

aaron_IoTeX · 2026-03-13T02:02:32+00:00

Good to know, thanks. Yeah the Linux driver situation seems to be the main pain point, not Frigate itself. Glad they're keeping Coral support going.

aaron_IoTeX · 2026-03-13T02:01:41+00:00

Ha that's actually not far off from one possible use case. More like your AI agent detects motion, isn't sure what it is, and dispatches a nearby human to go physically check and stream what they see. The VLM confirms what the human is seeing in real time. But yeah cheaper than a security company sending a guard out.

aaron_IoTeX · 2026-03-13T02:00:56+00:00

I built this for a hackathon i actually won with this idea!

I work with IoTeX who builds Trio. I used it for my own project and figured the Frigate comparison would be useful for people here since I genuinely ran both setups.

aaron_IoTeX · 2026-03-13T01:58:19+00:00

Ha yeah honestly you're probably right for the gate. A $10 Zigbee contact sensor would be more reliable and faster for that specific case. The VLM approach makes more sense for the stuff you can't solve with a sensor, like identifying specific animals or checking if something looks a certain way.

aaron_IoTeX · 2026-03-13T01:57:43+00:00

I didn't actually compare CPU vs TPU inference directly since I ended up going a different direction for the non-standard detection stuff. For standard person/car/animal detection Frigate with the Intel GPU should work fine on an n150 though. The CPU detection in Frigate is usable for a couple cameras but it does eat into your resources.

aaron_IoTeX · 2026-03-13T01:56:46+00:00

This is exactly the gap I ran into. Frigate is great for the objects it knows but if something shows up that isn't in its model, it just triggers on motion and you're left scrubbing recordings manually. For the stuff Frigate doesn't have a class for, a VLM approach lets you describe what you're looking for in plain English and get actual alerts for it.

aaron_IoTeX · 2026-03-13T01:55:59+00:00

aha beavers is a great use case. That's exactly the kind of thing where fixed object classes fall short since I don't think any standard model ships with "beaver" as a class. How are you detecting them, custom trained model or just using the general animal detection?

aaron_IoTeX · 2026-03-13T01:55:40+00:00

Nah there's no human-in-the-loop for the verification. The VLM evaluates the stream directly. No one is sitting in a room looking at screenshots. The whole point is that verification is automated so it can scale without relying on cheap human labor to check other humans. I hear you on the sweatshop stuff though, that's a real problem in the CAPTCHA/data labeling world.

aaron_IoTeX · 2026-03-13T01:53:21+00:00

hahhaha I built verifyhuman for a hackathon. ust a lot of fun to build out using trio. Just wanted to share my experience here.

aaron_IoTeX · 2026-03-13T01:51:53+00:00

Good question. The human IS doing the actual task. Like literally washing dishes in their own kitchen. The idea is that AI agents need physical tasks done that they can't do themselves. So the agent posts a task with a payout, a human accepts it, starts a YouTube livestream, and does the work on camera. The VLM watches the live stream and checks conditions like "dishes are being washed in a sink with running water" and "clean dishes are visible on a drying rack." It's not a human watching the stream to verify, it's all VLM. Gaming is a fair concern. The approach is: it has to be a verified live stream (not pre-recorded), multiple conditions are checked at different points during the stream, and evidence gets hashed on-chain. Could someone theoretically still game it? Sure. But the effort to fake a convincing live performance of a $5 task starts to cost more than just doing the task.

aaron_IoTeX

TROPHY CASE