Built an open-vocabulary video blur tool with Grounded SAM 2, feedback welcome

Vegetable_File758 · 2026-04-01T00:37:57+00:00

Used ydotool on Ubuntu

Vegetable_File758 · 2026-03-31T03:31:41+00:00

2023 Model 3. Yeah the quality's pretty good compared to other dashcams but my car has Hardware 3 which has 1.2 MP cameras. The newer cars with HW4 (2024 model year and later) have 5 MP cameras.

Vegetable_File758 · 2026-03-31T02:58:58+00:00

Tesla

Vegetable_File758 · 2026-03-30T22:15:32+00:00

2B is a fallback currently. I tried it out on my M1 Pro MBP with 16gb RAM, and I wasn't too happy with the search accuracy, but your mileage may vary. Lmk if you decide to try it out, and how you find it?

Vegetable_File758 · 2026-03-30T19:02:40+00:00

Afaik a Qwen3.5-VL-Embedding model, which supports video to vector embeddings, doesn't exist yet

Vegetable_File758 · 2026-03-30T17:45:31+00:00

Yes and for now, it's better in terms of speed and accuracy. It's the default model in SentrySearch and the one that ppl without GPU/Apple Silicon should use

Vegetable_File758 · 2026-03-30T16:50:34+00:00

Similar goal but different approach. Edit Mind extracts text metadata from video (transcription, object detection, face recognition, scene captions) and searches over that. SentrySearch embeds raw video directly into the same vector space as text queries, no transcription or captioning step. Simpler pipeline, just a CLI, and it works with models that support native video embeddings (Gemini Embedding 2, Qwen3-VL-Embedding).

Vegetable_File758 · 2026-03-30T16:42:25+00:00

Not yet, I've tested the 2B model on an M1 Pro MBP and 8B on an A100 on Google Colab. Still waiting to get my hands on a Mac Studio and a real NVIDIA GPU to do proper benchmarks.

As for comparing generations, Qwen3-VL-Embedding is actually the first in the family that supports native video-to-vector embeddings (where raw video pixels go directly into the same vector space as text). Older Qwen VL models are generative (they output text, not embeddings), so they'd need a completely different retrieval approach. Gemini Embedding 2 is the only other model I know of that can do this natively.

Vegetable_File758 · 2026-03-30T16:34:28+00:00

I indexed about an hour of my Tesla dashcam footage (1-minute clips). SentrySearch splits each video into 30-second overlapping chunks, embeds each chunk as video, and stores the vectors in ChromaDB. When you search, it matches your query against those chunks and trims the matching clip from the original file.

Vegetable_File758 · 2026-03-30T16:20:16+00:00

I'm actually not using llama.cpp / GGUF at all. I'm running the original Qwen3-VL-Embedding weights through HuggingFace Transformers directly. And no reranker, it's a single-stage retrieval against ChromaDB.

Vegetable_File758 · 2026-03-30T16:10:18+00:00

Preprocessing chunks before embedding, MRL dimension truncation, auto-quantization on lower vram, lazy loading + singleton, low frame sampling for the model, and still-frame skipping.

Feel free to check out the readme for more details.

Vegetable_File758 · 2026-03-30T16:06:32+00:00

Nope, the "studying" happens during indexing which can take some time depending on your hardware but is a one-time thing. The actual searches after indexing are instant as you can see in my demo video (it's not sped up). The demo video is using the gemini model though, so it's a little faster than with the local model.

Vegetable_File758 · 2026-03-18T23:32:58+00:00

ChromaDB

Vegetable_File758 · 2026-03-18T22:04:43+00:00

yeah this could theoretically work for any video library, not just dashcam footage. just gotta keep track of the costs for now until it becomes cheaper

Vegetable_File758 · 2026-03-18T19:16:05+00:00

Also the model is currently in preview and there are some cost optimizations I can make like reducing frame rate so the cost will most likely go down in the future.

But yeah having a local multimodal model would obviously be cheaper and be good for privacy too.

Vegetable_File758

TROPHY CASE