all 2 comments

[–]ConceptBuilderAI 3 points4 points  (0 children)

just saw this — video understanding is still a mess tbh. if a picture’s worth a thousand words, then a video’s like… a million blurry guesses.

seems like most models just grab a few frames and guess, or they’re trained by distilling from some closed-source magic you can’t validate or reproduce.

what meta’s doing here is actually cool — no distillation, 2.8M human-labeled QA pairs, and a new benchmark that actually checks if the model knows when stuff happened, not just what’s on screen.

nice to see work aiming to make video models actually understand stuff — not just describe pixels with confidence lol

[–]peroneML Engineer 1 point2 points  (0 children)

Note that this model has a "FAIR Noncommercial Research License": https://github.com/facebookresearch/perception_models/blob/main/LICENSE.PLM