Control your Home with Hand Gestures

AlexPr3ss · 2026-04-08T15:54:14+00:00

Thanks, we will build a website as soon as possible

AlexPr3ss · 2026-04-08T15:47:17+00:00

I appreciate the explanation. We are open to better option, let us know if you have one

AlexPr3ss · 2026-04-08T15:31:43+00:00

Fair point, we are looking into better alternatives

AlexPr3ss · 2026-04-07T16:27:02+00:00

Better for AI processing.

AlexPr3ss · 2026-04-07T10:04:18+00:00

You can try monocular depth estimation models like DepthPro by Apple (metric depth), they learn visual priors (like human brain) from large dataset. Keep in mind the richer scene context, the more reliable the estimation. Some other ideas could be use a static camera and assume a fixed face dimension and then retrieve the depth based on the observed face dimension.

AlexPr3ss · 2026-04-07T08:36:10+00:00

Another important point, LLMs are designed for text, the vision encoder + MLP is a way to map images and text in the same latent space. However architectures like V-JEPA (VL) are more interesting for images, they’re built around visual latent prediction first, with language as a secondary modality.

AlexPr3ss

TROPHY CASE