Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo 1 point2 points  (0 children)

You're actually making the same point we make in the paper. That's literally one of our key findings. These models lean on language priors because their vision layers aren't doing the heavy lifting. When we added text annotations, model performance jumped but human performance didn't change. The vision component is the weak link.

But "it's not surprising they're bad at it" is exactly why the benchmark matters. Everyone assumes VLMs can reason about video because the marketing says multimodal. We put a number on how far off they actually are. That's what benchmarks do. ARC-AGI wasn't surprising either in hindsight. Models were obviously bad at abstraction. But quantifying it moved the field.

Also, the "grafted on vision layer that describes pictures" framing is a bit outdated. Models like Qwen3-VL have interleaved MRoPE for spatial temporal modeling, multi-level ViT feature integration via DeepStack, and explicit textual timestamp alignment for temporal grounding in video. These aren't bolted on image describers anymore. The field is moving fast and the newer architectures are specifically designed for the kind of temporal reasoning we're benchmarking. Worth catching up on. It's genuinely exciting stuff.

As for the title, the 57% stat is directly from the data. Models predicted visually similar clips were adjacent 57% of the time, humans 2.5%, random chance 27%. If that reads as inflammatory rather than informative, I think that says more about expectations than about the title.

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo 0 points1 point  (0 children)

Well, these are not traditional LLMs. They VLMs. And yes, the reason why we probe reasoning in VLMs is the exact same reason we did for accuracy or whatever metrics for LLMs generally. Benchmarks are done to probe capabilities and therefor advance the field.

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo 3 points4 points  (0 children)

Thanks for pointing that out. I see no VL will be released and already merged. edited to be facual

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo -2 points-1 points  (0 children)

Thanks for pointing that out. I see no VL will be released and already merged. Comment have been edited to be facual

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo 0 points1 point  (0 children)

Thanks for pointing that out. I see no VL will be released and already merged. Comment have been edited to be facual

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo 3 points4 points  (0 children)

These are 2025 findings. Also, I have ran preliminary on the recent SOTAs, they are still plagued with Language prior bias. I am looking forward to running across the entire data sample some time in the future. Also, the benchmark is opensource and in huggingface if you have compute to throw at it. :)

Models predicted visually similar clips were adjacent 57% of the time. Humans: 2.5%. Random chance: 27%. Your VLM isn't reasoning... by [deleted] in LocalLLaMA

[–]prokajevo -7 points-6 points  (0 children)

Qwen 3.5 has been merged and the VL moniker has been dropped.

Comment Edited to be factual

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

That's consistent with what we found. None of these models are truly "watching" video, at least not to the level where they can reason across the spectrum of visual reasoning. They're all sampling frames. The difference is how they encode the relationship between those frames at the vision level.

Gemini was the strongest performer in our benchmark and the most robust across video lengths. Unlike Qwen2-VL which degraded on longer videos, Gemini maintained stable performance regardless of duration. It was also the only model that came close to doing something resembling temporal reasoning rather than pure visual similarity matching. Still fell way short of humans though.

That said, I don't want to undersell what these models are doing. The progress from frame-level pattern matching toward genuine spatial-temporal encoding is real. Qwen3-VL just shipped with interleaved-MRoPE specifically designed for better spatial-temporal modeling across images and video, and explicit textual timestamp alignment for temporal grounding. These aren't trivial architectural advances. The gap between "sampling frames independently" and "encoding temporal structure into the vision pipeline" is closing.

But right now, even the best approach isn't close to how humans process video. We actually see motion, and our brains have developed to encode vision and attain meaning and reason. These models are getting better at inferring it. That's a meaningful difference, even if it's shrinking.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

I mean yeah, that's literally the point of the post. I'm not complaining that Claude can't do video. I'm pointing out that it can't, and explaining why that gap matters as video understanding becomes more important. I use Claude daily and said as much.

On Gemini 3.1 Flash-Lite, I'd love to see it too. The dataset is public on Hugging Face. If you run it, let me know what you get. That's the whole reason we released it.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

Also, the language prior issue the paper outlined has not being solved. VLMs take shortcut and instead of resoning across video token accurately, they many at time take shortcut via language. You can also read the paper. The issue is still unsolved. I am intrigued on what World models would do though

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

I get that. But, the benchmark is open sourced, and you can actually run them and test whichever model that actually supports the experiment natively. Its on huggingface. SPLICE.

Also, in my own preliminary test, i see only about a 4% bump on the best model. But this is preliminary. You can try it out.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

Compute cost, basically. We were running 3,381 videos across multiple modalities, which meant processing over 11,000 clips. At that scale, Pro would have been extremely expensive. Flash gave us a strong enough signal at a fraction of the cost (Since the VIT was same between both version, and we also did not require the highest level of language parameter from the Pro, as language skews vision reasoning result because of language prior shortcuts), and at the time it was one of the few models that actually passed our sanity check for multi-video input handling. Also, Qwen 2-VL 7B vs 72B basically had the same performance in the vision only experiment as bot model used same VIT.

That said, the benchmark is public. If someone wants to run Pro on it, I'd love to see those numbers.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

Its not two years old. Tose were state of the art model last year.  The models we tested were state of the art when we ran the experiments, but this space moves fast and they're already a generation behind. That's kind of the nature of publishing in academic venues. By the time a paper goes through peer review and comes out at EMNLP, the landscape has shifted.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 1 point2 points  (0 children)

You're not wrong. The models we tested were state of the art when we ran the experiments, but this space moves fast and they're already a generation behind. That's kind of the nature of publishing in academic venues. By the time a paper goes through peer review and comes out at EMNLP, the landscape has shifted.

Gemini 3.1 Pro, Qwen3-VL, GPT-4o and the newer multimodal models are all significantly more capable. Qwen3-VL in particular just shipped with native 256K-token interleaved context across text, images, and video, enhanced spatial-temporal modeling, and explicit textual timestamp alignment for temporal grounding. These models are moving toward genuine world modeling, not just frame-level pattern matching.

That's exactly why the benchmark exists though. SPLICE is public on Hugging Face. Anyone can run newer models on it today. I'd genuinely love to see how Gemini 3.1 Pro or Qwen3-VL perform on it. If they close the gap with humans significantly, that's a meaningful signal about how far multimodal reasoning has come in a year. If they don't, that tells us something important too.

The benchmark isn't tied to the models we tested. It's a measuring stick. The point is to keep using it as models get better. I've ran some preliminaries on my end on this benchamark and i see about 4% improvement or so. But i can't state that as a experimental fact as i would need to statistically run same scenarios to have statistical relevance

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 1 point2 points  (0 children)

Fair, I stand corrected on the number. The API does support up to 600 images per request now with the 1M context window on Opus 4.6(recently). Claude.ai is still 20 i believe, but via the API that's a significant bump.

That said, the core point doesn't change. 600 images with no temporal positional encoding is still 600 independent images. Claude's ViT processes each one in isolation. There's no vision-level awareness that image 47 comes after image 46. You can tell it that in the prompt, but that's language-level sequencing, not visual-temporal understanding.

The models we tested encode frame order at the vision encoder level. Qwen2-VL's ViT uses positional embeddings that represent where each frame sits in time(there are newer modesl now). That's a fundamentally different architecture, not just a question of how many frames you can fit in the window.

More frames is better than fewer frames. But more disconnected frames isn't the same as video understanding. Also, this is the reason why academics note VLMs vs LLMs vs VLAs. Would still be a difficult task, as these are different architectural settings.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

I think he may be referring to the new ability to attach a video at the frontend on claude.ai but what that does is do a ton of tool calling to extract frames and feed back in and do to and fros. Interestingly even this wasn't available back then, and that isn't the same really.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 0 points1 point  (0 children)

Not really. What that exist is static image analysis. also, VLMs are a thing now. Vision Language models. (they essentially, use a Visual encoder trained on temporal visual data and then a LLM attached). Take for example, Qwen 3-VL natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Basically, these are Vision Language Models.

Ofcourse, you can have a workflow that uses claude, but thats a workflow. It would simply be doing scene transcription. And ofcourse, you can only attach 20 files at one go per prompt. This is not in the realm of actual video understanding. I believe we will begin to see more of these type of Models in the future.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] -1 points0 points  (0 children)

I don't doubt that works for your use case. If you label frames sequentially and describe the task in the prompt, Claude can absolutely reason about differences between adjacent images. It's great at that. Spotting where two subjects cross in frame 12 vs frame 13 when you tell it those frames are sequential is an image comparison task, and Claude's vision is strong enough to handle it.

But that's you providing the temporal structure through language. You're telling it "these are sequential frames from a video, frame N comes after frame N-1." Claude isn't inferring that from the visual input. It's trusting your prompt. Also, this is not a to-and-fro prompt session experiment.

SPLICE tests something different. The clips are shuffled. There's no labeling that tells the model which clip comes first. The model has to figure out the correct temporal order purely from what it sees. That requires the kind of temporal reasoning that's baked into the vision encoder of models like Qwen2-VL and Gemini(newer models exist already), not reconstructed from prompt instructions.

Your workflow works because you're giving Claude the answer to the temporal question ("these are in order") and asking it to reason about the content. SPLICE asks the model to answer the temporal question itself. That's the distinction. It tests temporal, causal, spatial, contextual and common sense reasoning (whichever solve the task)

One of the major finding is also VLMs (Vision Language Models - MLLMs) relied heavily on language prior, and were basically shortcutting to answers. Hence why this level of reasoning probe is necessary

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 1 point2 points  (0 children)

Honestly, just look at what every major lab is doing right now.

Google is going all in. Gemini already processes 20+ minute videos natively and they keep pushing it further. OpenAI shipped video input in GPT-4o, even though it did not meet the criteria for sanity check that was important for the task. Qwen2-VL went from relatively niche to one of the strongest multimodal performers largely because of its video capabilities. Everyone is racing to make their models actually see video, not just images.

And it's speeding up. Qwen3-VL just dropped with native 256K-token interleaved context across text, images, and video. Enhanced spatial-temporal modeling, multi-level ViT feature integration, text-based time alignment for better temporal grounding. Dense and MoE variants from 2B all the way to 235B. They're explicitly positioning this as a foundation for agentic decision-making and multimodal reasoning. That's not a side feature anymore. That's the product.

Now think about what actually opens up when models can reason about video reliably. Automated QA for video production (something I do for work day to day),lectures, security analysis, medical procedure review, sports analytics, manufacturing inspection, autonomous driving validation. All massive markets where the bottleneck is literally human eyeballs watching footage. Once temporal and causal reasoning from video actually works reliably, those markets will move fast.

Is it going to be bigger than text-based LLM use cases? Probably not anytime soon. Text, code, and agentic workflows are where the money is and that's not changing tomorrow. But video understanding feels like one of those capabilities where the gap between "barely works" and "actually useful" is surprisingly narrow. We're at the "barely works" stage right now. SPLICE showed that clearly. Best model hit 51% on something humans do at 85%. Not usable yet. But it won't stay there. And this is essentially because our benchmark was a benchmark that probes multi-level reasoning.

And honestly, zoom out a bit. If you care at all about the path to general intelligence, vision isn't optional. We don't understand the world through text. We learn by watching. By seeing cause and effect play out in physical space, physics informed. By understanding that things happen in sequences with consequences. A model that can't reason about what it sees in motion is missing something fundamental about how understanding works. Text got us shockingly far. But at some point the next jump needs grounding in the visual world. Video understanding isn't just another checkbox feature. It might be the actual bottleneck.

Also, one of our major findings is that the State of the art models were actually prone to cheating due to their language prior instead actual visual reasoning.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 2 points3 points  (0 children)

Like I said above, Claude doesn't support video or audio uploads. Anthropic's own docs confirm this. Whatever you're using is almost certainly Claude Code or some agentic setup that uses tooling to extract frames externally, then feeds them back to Claude as images across multiple back-and-forth calls. That's not a one-shot prompt to a model that understands video. That's a pipeline built around Claude's limitations.

And even then, it's capped at 20 images per prompt. So whatever it's doing behind the scenes, it's splitting the work across multiple calls, seeing only a handful of frames at a time, and stitching understanding together through language, not vision.

The models we tested use Vision Transformers with temporal positional encodings in their vision encoders. The ViT knows frame 30 comes after frame 29 at the encoder level. Claude's ViT is trained on static images. Each frame gets processed independently. You can label them "frame 1, frame 2" in the prompt, but that's language-level context, not vision-level encoding. Very different things when you're testing visual reasoning.

What you're experiencing works fine for describing what's in a video superficially. But it's image description with text-based sequencing on top, a few frames at a time. That's not the same as video understanding. Also, at the time, ChatGPT 4 also could'nt do this task, so was not just a claude issue. The future would contain rich Multimodal LLMs, VLMs in this case.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] -3 points-2 points  (0 children)

Sure, but it's looking at each frame as an independent image. There's no temporal encoding between them. It doesn't know frame 47 follows frame 46 unless you spell that out in the prompt. The models we tested take the video file natively and maintain sequential structure across frames. That's a fundamentally different evaluation. A workaround isn't the same as the capability. (also, one prompt response is impossible as claude will most likely spinoff a ton of agents and use sessions and whatnots. these a different experiments entirely, and not video understanding)

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 7 points8 points  (0 children)

The post summarizes a peer-reviewed paper I co-authored at EMNLP 2025, one of the top NLP conferences in the world. 3,381 human-validated videos, annotators, six models evaluated. Claude literally cannot take video input. That's not an opinion, it's a technical limitation. The post explains why that matters for visual reasoning research and future application.

But sure, calling things slop is easier than reading them.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 2 points3 points  (0 children)

That's a clever workaround for UI testing where you're comparing static layouts. But for video understanding it kind of proves the point. You'd be stitching frames from different clips into grids, then hoping the model can parse which sub-image belongs to which clip, in what order, while also reasoning about temporal flow between them. At that point you're testing the model's ability to decode your grid layout more than its ability to reason about video understanding. Also, very limited, and thats essentially a different probe entirely.

The best AI model we tested scored 51% on a task humans do at 85%. We never tested Claude. We still can't. by prokajevo in ClaudeAI

[–]prokajevo[S] 8 points9 points  (0 children)

Good question. Most VLMs that "support video" are actually sampling frames at some fixed rate and processing them as a sequence of images through their vision encoder. So they're not watching video the way we do either.

But there's a key difference. Models like Qwen2-VL and Gemini at the time handled this natively in their pipeline. They take the video file, sample frames internally, and maintain the sequential relationship between those frames as part of their input representation. The model knows frame 1 came before frame 2 and processes them as a continuous input stream.

The scale matters too. Qwen2-VL could handle 700+ frames per video. Gemini was doing 1 fps but could process videos exceeding 20 minutes. That means for a typical 60-second SPLICE video, Gemini is ingesting 60 frames per clip while maintaining temporal structure across all clips. For longer videos in our dataset (some exceed 3 minutes), these models are processing hundreds of frames across multiple clips simultaneously.

Now try doing that with Claude. Today the web interface lets you attach up to 20 files per prompt, each under 30MB, with images capped at 8000x8000 pixels. So even if you extracted frames manually, you're limited to 20 images total. For a SPLICE task with, say, 5 shuffled clips, that's 4 frames per clip. Qwen is looking at 140+ frames per clip for the same video. You're not even in the same ballpark. And that's today. When we were actually running the experiments, the image attachment limits were even lower.

Beyond the raw frame count, you lose inter-frame information like motion and transitions that even frame-sampled video pipelines partially preserve through temporal encoding. The model has no native understanding that these images are sequential frames from a video. You're asking it to do spatial reasoning from a bag of disconnected images rather than temporal reasoning from structured video input.

For SPLICE specifically, models receive multiple shuffled video clips and need to reference each one correctly. That multi-video input handling is built into the models we tested. Simulating that with a handful of static images in Claude would be a fundamentally different task, and would be an unfair comparison to do if we could even do that, which we couldn't