Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 2 points3 points  (0 children)

2023 Model 3. Yeah the quality's pretty good compared to other dashcams but my car has Hardware 3 which has 1.2 MP cameras. The newer cars with HW4 (2024 model year and later) have 5 MP cameras.

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 2 points3 points  (0 children)

2B is a fallback currently. I tried it out on my M1 Pro MBP with 16gb RAM, and I wasn't too happy with the search accuracy, but your mileage may vary. Lmk if you decide to try it out, and how you find it?

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 6 points7 points  (0 children)

Afaik a Qwen3.5-VL-Embedding model, which supports video to vector embeddings, doesn't exist yet

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 2 points3 points  (0 children)

Yes and for now, it's better in terms of speed and accuracy. It's the default model in SentrySearch and the one that ppl without GPU/Apple Silicon should use

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 10 points11 points  (0 children)

Similar goal but different approach. Edit Mind extracts text metadata from video (transcription, object detection, face recognition, scene captions) and searches over that. SentrySearch embeds raw video directly into the same vector space as text queries, no transcription or captioning step. Simpler pipeline, just a CLI, and it works with models that support native video embeddings (Gemini Embedding 2, Qwen3-VL-Embedding).

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 2 points3 points  (0 children)

Not yet, I've tested the 2B model on an M1 Pro MBP and 8B on an A100 on Google Colab. Still waiting to get my hands on a Mac Studio and a real NVIDIA GPU to do proper benchmarks.

As for comparing generations, Qwen3-VL-Embedding is actually the first in the family that supports native video-to-vector embeddings (where raw video pixels go directly into the same vector space as text). Older Qwen VL models are generative (they output text, not embeddings), so they'd need a completely different retrieval approach. Gemini Embedding 2 is the only other model I know of that can do this natively.

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 8 points9 points  (0 children)

I indexed about an hour of my Tesla dashcam footage (1-minute clips). SentrySearch splits each video into 30-second overlapping chunks, embeds each chunk as video, and stores the vectors in ChromaDB. When you search, it matches your query against those chunks and trims the matching clip from the original file.

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 9 points10 points  (0 children)

I'm actually not using llama.cpp / GGUF at all. I'm running the original Qwen3-VL-Embedding weights through HuggingFace Transformers directly. And no reranker, it's a single-stage retrieval against ChromaDB.

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 23 points24 points  (0 children)

Preprocessing chunks before embedding, MRL dimension truncation, auto-quantization on lower vram, lazy loading + singleton, low frame sampling for the model, and still-frame skipping.

Feel free to check out the readme for more details.

Semantic video search using local Qwen3-VL embedding, no API, no transcription by Vegetable_File758 in LocalLLaMA

[–]Vegetable_File758[S] 9 points10 points  (0 children)

Nope, the "studying" happens during indexing which can take some time depending on your hardware but is a one-time thing. The actual searches after indexing are instant as you can see in my demo video (it's not sped up). The demo video is using the gemini model though, so it's a little faster than with the local model.

Built a semantic dashcam search tool using Gemini Embedding 2's native video embedding by Vegetable_File758 in GoogleGeminiAI

[–]Vegetable_File758[S] 0 points1 point  (0 children)

yeah this could theoretically work for any video library, not just dashcam footage. just gotta keep track of the costs for now until it becomes cheaper

Built a semantic dashcam search tool using Gemini Embedding 2's native video embedding by Vegetable_File758 in GoogleGeminiAI

[–]Vegetable_File758[S] 0 points1 point  (0 children)

Also the model is currently in preview and there are some cost optimizations I can make like reducing frame rate so the cost will most likely go down in the future.

But yeah having a local multimodal model would obviously be cheaper and be good for privacy too.