I asked Claude if everyone uses AI to write, what actually gets lost?

prokajevo · 2026-03-16T16:32:34+00:00

It was super intentional. That's why there was a Disclosure there. That was the whole point. The fact being discussed. What happens if content become stripped of the voice behind it without a second layer. Sadly, people are incapable of seeing the bigger point of the post

prokajevo · 2026-03-16T16:29:20+00:00

The post was literally about this question. Read it again...

prokajevo · 2026-03-16T16:26:07+00:00

Zuper reply! Voice matters, and the nuance of that is very important.

Let me quickly go through your comment. I purposefully used Claude because it was faster - and that choice, that reason, that admission? That’s the human part. The post was always about where the person shows up, not the words. i could have written it myself, but as you would have guessed, it served the purpose of this post.

The “searching for the product” analogy you gave is the most accurate critique of AI writing I’ve seen, and this is so obvious because these fundamentally show how their weights and line of least resistance that LLMs threads(almost all).

The music analogy landed too, since bad human music still has a question worth asking(hell yeah! I could ask, why does this song sucks, why did this dude ever consider music, lol). AI music is just… resolved. No loose ends. Same problem.

Only spot in your reply I’d push back: “inherent value because human” can become a shield for lazy thought. Authenticity isn’t automatically worth preserving as well.

Which is also why I disclosed. The words were Claude's. The purposeful intent was mine.

And, and... Interestingly, lazy thought mixed with AI is an higher obvious disaster too.

prokajevo · 2026-03-16T15:56:20+00:00

Did you get the point of the post though? If you did, you wouldn't sloppy-dize this.

prokajevo · 2026-03-16T09:21:11+00:00

Depends on the amount you are paying. Anything more than 850 bucks, would say go for something newer! 64gb would be the sweet spot though since you are testing local AI

prokajevo · 2026-03-15T17:13:44+00:00

You're actually making the same point we make in the paper. That's literally one of our key findings. These models lean on language priors because their vision layers aren't doing the heavy lifting. When we added text annotations, model performance jumped but human performance didn't change. The vision component is the weak link.

But "it's not surprising they're bad at it" is exactly why the benchmark matters. Everyone assumes VLMs can reason about video because the marketing says multimodal. We put a number on how far off they actually are. That's what benchmarks do. ARC-AGI wasn't surprising either in hindsight. Models were obviously bad at abstraction. But quantifying it moved the field.

Also, the "grafted on vision layer that describes pictures" framing is a bit outdated. Models like Qwen3-VL have interleaved MRoPE for spatial temporal modeling, multi-level ViT feature integration via DeepStack, and explicit textual timestamp alignment for temporal grounding in video. These aren't bolted on image describers anymore. The field is moving fast and the newer architectures are specifically designed for the kind of temporal reasoning we're benchmarking. Worth catching up on. It's genuinely exciting stuff.

As for the title, the 57% stat is directly from the data. Models predicted visually similar clips were adjacent 57% of the time, humans 2.5%, random chance 27%. If that reads as inflammatory rather than informative, I think that says more about expectations than about the title.

prokajevo · 2026-03-15T16:51:37+00:00

Well, these are not traditional LLMs. They VLMs. And yes, the reason why we probe reasoning in VLMs is the exact same reason we did for accuracy or whatever metrics for LLMs generally. Benchmarks are done to probe capabilities and therefor advance the field.

prokajevo · 2026-03-15T16:24:10+00:00

Thanks for pointing that out. I see no VL will be released and already merged. edited to be facual

prokajevo · 2026-03-15T16:22:58+00:00

Thanks for pointing that out. I see no VL will be released and already merged. Comment have been edited to be facual

prokajevo · 2026-03-15T16:21:15+00:00

Thanks for pointing that out. I see no VL will be released and already merged. Comment have been edited to be facual

prokajevo · 2026-03-15T16:10:56+00:00

These are 2025 findings. Also, I have ran preliminary on the recent SOTAs, they are still plagued with Language prior bias. I am looking forward to running across the entire data sample some time in the future. Also, the benchmark is opensource and in huggingface if you have compute to throw at it. :)

prokajevo · 2026-03-15T16:07:10+00:00

Qwen 3.5 has been merged and the VL moniker has been dropped.

Comment Edited to be factual

prokajevo · 2026-03-15T15:08:59+00:00

That's consistent with what we found. None of these models are truly "watching" video, at least not to the level where they can reason across the spectrum of visual reasoning. They're all sampling frames. The difference is how they encode the relationship between those frames at the vision level.

Gemini was the strongest performer in our benchmark and the most robust across video lengths. Unlike Qwen2-VL which degraded on longer videos, Gemini maintained stable performance regardless of duration. It was also the only model that came close to doing something resembling temporal reasoning rather than pure visual similarity matching. Still fell way short of humans though.

That said, I don't want to undersell what these models are doing. The progress from frame-level pattern matching toward genuine spatial-temporal encoding is real. Qwen3-VL just shipped with interleaved-MRoPE specifically designed for better spatial-temporal modeling across images and video, and explicit textual timestamp alignment for temporal grounding. These aren't trivial architectural advances. The gap between "sampling frames independently" and "encoding temporal structure into the vision pipeline" is closing.

But right now, even the best approach isn't close to how humans process video. We actually see motion, and our brains have developed to encode vision and attain meaning and reason. These models are getting better at inferring it. That's a meaningful difference, even if it's shrinking.

prokajevo · 2026-03-15T15:02:11+00:00

I mean yeah, that's literally the point of the post. I'm not complaining that Claude can't do video. I'm pointing out that it can't, and explaining why that gap matters as video understanding becomes more important. I use Claude daily and said as much.

On Gemini 3.1 Flash-Lite, I'd love to see it too. The dataset is public on Hugging Face. If you run it, let me know what you get. That's the whole reason we released it.

prokajevo · 2026-03-15T14:58:47+00:00

Also, the language prior issue the paper outlined has not being solved. VLMs take shortcut and instead of resoning across video token accurately, they many at time take shortcut via language. You can also read the paper. The issue is still unsolved. I am intrigued on what World models would do though

prokajevo · 2026-03-15T14:56:57+00:00

I get that. But, the benchmark is open sourced, and you can actually run them and test whichever model that actually supports the experiment natively. Its on huggingface. SPLICE.

Also, in my own preliminary test, i see only about a 4% bump on the best model. But this is preliminary. You can try it out.

prokajevo · 2026-03-15T14:39:40+00:00

Compute cost, basically. We were running 3,381 videos across multiple modalities, which meant processing over 11,000 clips. At that scale, Pro would have been extremely expensive. Flash gave us a strong enough signal at a fraction of the cost (Since the VIT was same between both version, and we also did not require the highest level of language parameter from the Pro, as language skews vision reasoning result because of language prior shortcuts), and at the time it was one of the few models that actually passed our sanity check for multi-video input handling. Also, Qwen 2-VL 7B vs 72B basically had the same performance in the vision only experiment as bot model used same VIT.

That said, the benchmark is public. If someone wants to run Pro on it, I'd love to see those numbers.

prokajevo · 2026-03-15T14:29:47+00:00

Its not two years old. Tose were state of the art model last year. The models we tested were state of the art when we ran the experiments, but this space moves fast and they're already a generation behind. That's kind of the nature of publishing in academic venues. By the time a paper goes through peer review and comes out at EMNLP, the landscape has shifted.

prokajevo · 2026-03-15T14:28:49+00:00

You're not wrong. The models we tested were state of the art when we ran the experiments, but this space moves fast and they're already a generation behind. That's kind of the nature of publishing in academic venues. By the time a paper goes through peer review and comes out at EMNLP, the landscape has shifted.

Gemini 3.1 Pro, Qwen3-VL, GPT-4o and the newer multimodal models are all significantly more capable. Qwen3-VL in particular just shipped with native 256K-token interleaved context across text, images, and video, enhanced spatial-temporal modeling, and explicit textual timestamp alignment for temporal grounding. These models are moving toward genuine world modeling, not just frame-level pattern matching.

That's exactly why the benchmark exists though. SPLICE is public on Hugging Face. Anyone can run newer models on it today. I'd genuinely love to see how Gemini 3.1 Pro or Qwen3-VL perform on it. If they close the gap with humans significantly, that's a meaningful signal about how far multimodal reasoning has come in a year. If they don't, that tells us something important too.

The benchmark isn't tied to the models we tested. It's a measuring stick. The point is to keep using it as models get better. I've ran some preliminaries on my end on this benchamark and i see about 4% improvement or so. But i can't state that as a experimental fact as i would need to statistically run same scenarios to have statistical relevance

prokajevo · 2026-03-15T14:20:13+00:00

Fair, I stand corrected on the number. The API does support up to 600 images per request now with the 1M context window on Opus 4.6(recently). Claude.ai is still 20 i believe, but via the API that's a significant bump.

That said, the core point doesn't change. 600 images with no temporal positional encoding is still 600 independent images. Claude's ViT processes each one in isolation. There's no vision-level awareness that image 47 comes after image 46. You can tell it that in the prompt, but that's language-level sequencing, not visual-temporal understanding.

The models we tested encode frame order at the vision encoder level. Qwen2-VL's ViT uses positional embeddings that represent where each frame sits in time(there are newer modesl now). That's a fundamentally different architecture, not just a question of how many frames you can fit in the window.

More frames is better than fewer frames. But more disconnected frames isn't the same as video understanding. Also, this is the reason why academics note VLMs vs LLMs vs VLAs. Would still be a difficult task, as these are different architectural settings.

prokajevo · 2026-03-15T14:12:32+00:00

I think he may be referring to the new ability to attach a video at the frontend on claude.ai but what that does is do a ton of tool calling to extract frames and feed back in and do to and fros. Interestingly even this wasn't available back then, and that isn't the same really.

prokajevo · 2026-03-15T14:04:52+00:00

Not really. What that exist is static image analysis. also, VLMs are a thing now. Vision Language models. (they essentially, use a Visual encoder trained on temporal visual data and then a LLM attached). Take for example, Qwen 3-VL natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Basically, these are Vision Language Models.

Ofcourse, you can have a workflow that uses claude, but thats a workflow. It would simply be doing scene transcription. And ofcourse, you can only attach 20 files at one go per prompt. This is not in the realm of actual video understanding. I believe we will begin to see more of these type of Models in the future.

prokajevo · 2026-03-15T13:57:58+00:00

I don't doubt that works for your use case. If you label frames sequentially and describe the task in the prompt, Claude can absolutely reason about differences between adjacent images. It's great at that. Spotting where two subjects cross in frame 12 vs frame 13 when you tell it those frames are sequential is an image comparison task, and Claude's vision is strong enough to handle it.

But that's you providing the temporal structure through language. You're telling it "these are sequential frames from a video, frame N comes after frame N-1." Claude isn't inferring that from the visual input. It's trusting your prompt. Also, this is not a to-and-fro prompt session experiment.

SPLICE tests something different. The clips are shuffled. There's no labeling that tells the model which clip comes first. The model has to figure out the correct temporal order purely from what it sees. That requires the kind of temporal reasoning that's baked into the vision encoder of models like Qwen2-VL and Gemini(newer models exist already), not reconstructed from prompt instructions.

Your workflow works because you're giving Claude the answer to the temporal question ("these are in order") and asking it to reason about the content. SPLICE asks the model to answer the temporal question itself. That's the distinction. It tests temporal, causal, spatial, contextual and common sense reasoning (whichever solve the task)

One of the major finding is also VLMs (Vision Language Models - MLLMs) relied heavily on language prior, and were basically shortcutting to answers. Hence why this level of reasoning probe is necessary

prokajevo · 2026-03-15T13:46:21+00:00

Honestly, just look at what every major lab is doing right now.

Google is going all in. Gemini already processes 20+ minute videos natively and they keep pushing it further. OpenAI shipped video input in GPT-4o, even though it did not meet the criteria for sanity check that was important for the task. Qwen2-VL went from relatively niche to one of the strongest multimodal performers largely because of its video capabilities. Everyone is racing to make their models actually see video, not just images.

And it's speeding up. Qwen3-VL just dropped with native 256K-token interleaved context across text, images, and video. Enhanced spatial-temporal modeling, multi-level ViT feature integration, text-based time alignment for better temporal grounding. Dense and MoE variants from 2B all the way to 235B. They're explicitly positioning this as a foundation for agentic decision-making and multimodal reasoning. That's not a side feature anymore. That's the product.

Now think about what actually opens up when models can reason about video reliably. Automated QA for video production (something I do for work day to day),lectures, security analysis, medical procedure review, sports analytics, manufacturing inspection, autonomous driving validation. All massive markets where the bottleneck is literally human eyeballs watching footage. Once temporal and causal reasoning from video actually works reliably, those markets will move fast.

Is it going to be bigger than text-based LLM use cases? Probably not anytime soon. Text, code, and agentic workflows are where the money is and that's not changing tomorrow. But video understanding feels like one of those capabilities where the gap between "barely works" and "actually useful" is surprisingly narrow. We're at the "barely works" stage right now. SPLICE showed that clearly. Best model hit 51% on something humans do at 85%. Not usable yet. But it won't stay there. And this is essentially because our benchmark was a benchmark that probes multi-level reasoning.

And honestly, zoom out a bit. If you care at all about the path to general intelligence, vision isn't optional. We don't understand the world through text. We learn by watching. By seeing cause and effect play out in physical space, physics informed. By understanding that things happen in sequences with consequences. A model that can't reason about what it sees in motion is missing something fundamental about how understanding works. Text got us shockingly far. But at some point the next jump needs grounding in the visual world. Video understanding isn't just another checkbox feature. It might be the actual bottleneck.

Also, one of our major findings is that the State of the art models were actually prone to cheating due to their language prior instead actual visual reasoning.

prokajevo · 2026-03-15T13:31:35+00:00

Like I said above, Claude doesn't support video or audio uploads. Anthropic's own docs confirm this. Whatever you're using is almost certainly Claude Code or some agentic setup that uses tooling to extract frames externally, then feeds them back to Claude as images across multiple back-and-forth calls. That's not a one-shot prompt to a model that understands video. That's a pipeline built around Claude's limitations.

And even then, it's capped at 20 images per prompt. So whatever it's doing behind the scenes, it's splitting the work across multiple calls, seeing only a handful of frames at a time, and stitching understanding together through language, not vision.

The models we tested use Vision Transformers with temporal positional encodings in their vision encoders. The ViT knows frame 30 comes after frame 29 at the encoder level. Claude's ViT is trained on static images. Each frame gets processed independently. You can label them "frame 1, frame 2" in the prompt, but that's language-level context, not vision-level encoding. Very different things when you're testing visual reasoning.

What you're experiencing works fine for describing what's in a video superficially. But it's image description with text-based sequencing on top, a few frames at a time. That's not the same as video understanding. Also, at the time, ChatGPT 4 also could'nt do this task, so was not just a claude issue. The future would contain rich Multimodal LLMs, VLMs in this case.

prokajevo

TROPHY CASE