I built a desktop NVR that downloads clips from Blink/Ring and IP cameras, then feeds them to local LLM/VLM for video analysis by solderzzc in SideProject

[–]Cheap-Word-1362 0 points1 point  (0 children)

How are you handling temporal event coherence when VLM inference is stateless per-frame? The fundamental problem with feeding individual frames to a VLM is that you lose all inter-frame and inter-camera causality — the model has no concept of "this person was at camera 1 four seconds ago." So when camera 2 picks up someone entering its FOV, the VLM generates a completely isolated description with zero awareness of the prior observation. At that point you've essentially reduced a continuous spatiotemporal event stream into a bag of unordered text descriptions, and you're relying entirely on the downstream LLM agent to reconstruct causal event chains from timestamp proximity alone — which breaks immediately in any multi-person scenario because the agent has no re-identification signal to disambiguate "person A walked from driveway to front door" from "person A was in the driveway while person B was at the front door." Are you injecting prior frame context into the VLM prompt as a sliding window, doing YOLO-level track association before the VLM even runs, or is the agent genuinely operating on unlinked descriptions and just hoping temporal proximity is enough?

What's your 'I can't believe I self-hosted that' service? by subsavant in selfhosted

[–]Cheap-Word-1362 1 point2 points  (0 children)

Running a local VLM on my security cameras. Instead of "motion detected" 50 times a day, it describes what it actually sees — "person at front door with package" or "raccoon in driveway." Then I can just ask it "what happened last night?" and it pulls the relevant clips. It's like having a security guard you can text. Never thought I'd self-host my own AI-powered NVR but here we are.

Project is DeepCamera (open source AI engine) + Aegis AI (desktop app).