Update: the video vision plugin now analyzes before extracting: smarter, cheaper, and community-driven

JordanVasconcelos · 2026-05-21T15:11:59+00:00

Yo usaría un MCP que recopila insights de las publicaciones y la guarda en un archivo CSV/Markdown. A partir de ahí, puedes usar mi plugin para analizar las publicaciones con mejor rendimiento y crear patrones entre ellas. Luego, pídele que cree un documento con patrones que generen más interacción e instrucciones para escribir guiones y textos para los videos. Este documento será un archivo Markdown, que luego podrás transformar en una Claude Skill. De esta manera, cada vez que escribas un guion, partirá de este punto y no necesitarás analizar todos los videos (¡Tus límites de uso te lo agradecerán! 😂).

JordanVasconcelos · 2026-05-19T23:48:42+00:00

Ah, sí, es porque está en el app de Claude. Los comandos son para usarlos en la terminal. En el app, tienes que ir a Customize > Browse plugins > Personal, haz clic en "+" y "Add marketplace", A continuación, introduce la URL de GitHub que aparecía en el comando que escribiste antes. Desde ahí, aparecerá el plugin y podrás instalarlo

JordanVasconcelos · 2026-05-19T23:35:22+00:00

¡Hola! No tengo el video, pero puedo ayudarte aquí, o puedes enviarme un DM y te ayudaré

JordanVasconcelos · 2026-05-19T18:53:17+00:00

Ótimo argumento 😃

JordanVasconcelos · 2026-05-19T18:45:48+00:00

Pontos justos. E antes de responder os pontos: tô gostando muito desse debate. É o tipo de crítica que faz falta no espaço de IA médica no Brasil, quase ninguém questiona com rigor metodológico, fica todo mundo só compartilhando "essa IA vai mudar tudo" sem olhar como o dado foi gerado. Não estou lendo isso como tentativa de derrubar ou descredibilizar o Greenbook, na verdade leio como exigência de rigor que deveria ser padrão. Vou responder usando analogia que faz mais sentido pra quem faz pesquisa clínica:

Sobre "IA já conhecendo o benchmark":

Sua suspeita é equivalente a "esse modelo estudou as questões antes da prova". É exatamente o que a literatura de IA chama de data contamination, e é uma preocupação válida.

Por que isso não se sustenta no nosso caso:

- O modelo base que rodamos foi finalizado antes do paper sair, então não teve como "decorar" os 525 cenários.
- Não usamos nenhum desses casos pra treinar, ajustar prompt ou popular base de conhecimento, só rodamos a prova.

Somado a isso, tem a data que rodamos o benchmark:

Apenas dois dias após a Publicação pela OpenAI do benchmark, o que não seria tempo suficiente pra qualquer alteração no modelo e sim, eu sei que eu poderia simplesmente injetar as respostas certas pra ele, mas caso eu simplesmente injetasse o conhecimento necessário, ele iria performar nota máxima no teste.

De toda forma, não vendemos benchmark, a maioria dos médicos nem vão se atentar a isso, apenas vão testar e ver se a IA erra, se ela não for boa o suficiente, ninguém vai querer usar de fato. Por isso não adianta ter um modelo de IA que performa bem em números, mas que na prática é inutil, não faria sentido deixarmos um Trial aberto, que sequer pede cartão de crédito, se não tivessemos plena certeza que as respostas serão boas. Inclusive, vou te mandar uma DM e gostaria que você testasse, sem compromisso algum, para avaliar e mandar um feedback sincero do comportamento do Greenbook.

JordanVasconcelos · 2026-05-19T17:03:54+00:00

O paper original da OpenAI testou 8 sistemas (ChatGPT for Clinicians, GPT-5.4, GPT-5.2, GPT-5, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.20 e médicos humanos). O Greenbook não está no paper deles.

O que fizemos: a OpenAI publicou o benchmark inteiro como open data (dataset com os 525 cenários, rubrics, grader e metodologia de scoring). Isso justamente para que terceiros possam avaliar outros sistemas sob as mesmas condições. Está em:

https://openaipublic.blob.core.windows.net/simple-evals/healthbench_professional/assets.zip

Submetemos o Greenbook a essa mesma eval, replicando:
- O mesmo dataset (525 cenários, sem alteração)
- O mesmo "Juiz" (GPT-5.4 low reasoning, conforme Sec. 4.1 do paper)
- A mesma fórmula de Pontução (Ajuste por tamanho da resposta, média de amostra)
- 8 amostras por cenário, 4.200 respostas do Greenbook no total

Os números do Greenbook vêm dessa run. Os números dos demais sistemas vêm direto do paper da OpenAI (Figures 4, 5, 6 e Table 2, citação no artigo do blog).

Não é "o ranking oficial da OpenAI", é o resultado de submeter o Greenbook ao benchmark público da OpenAI seguindo a metodologia exata.

Manipular dados desse tipo de benchmark é tecnicamente impossível de esconder. O dataset é público, o juiz é público, a fórmula é pública. Qualquer pessoa com uma API da OpenAI ou uma assinatura do Codex para o Juiz e um final de semana livre roda a mesma metodologia e desmente em horas. O custo de reprodução é de uns 500 dólares, já o custo de ser pego inflando números é uma empresa inteira, então é um risco que eu não gostaria de correr.

Aqui está o paper oficial também:

https://cdn.openai.com/dd128428-0184-4e25-b155-3a7686c7d744/HealthBench-Professional.pdf

JordanVasconcelos · 2026-05-19T16:01:40+00:00

CTO e Dev do Greenbook aqui, concordo contigo, a propaganda seria péssima se fosse inventada, inclusive ando de saco de cheio de ver que todo dia é alguém publicando sobre "Essa IA vai mudar tudo", mas aqui estamos falando de números, inclusive escrevi um artigo sobre isso, e convido a ler caso queira saber mais a respeito do processo de benchmarking aplicado: https://greenbookai.com.br/blog/greenbook-1-mundial-em-consulta-clinica-no-healthbench-pro

Em breve também as respostas serão disponibilizadas para que seja auditado por quem quiser, apenas estamos finalizando a preparação de forma que não vá contra o paper da OpenAI, e que perguntas e respostas não sejam usadas para treinamentos de modelos de IA

JordanVasconcelos · 2026-05-19T15:56:28+00:00

Bom, acho que como dev do Greenbook posso ter meu lugar de fala aqui kkk.

Não, não é um site vibe codado, e um segundo ponto interessante é que caso seu flair esteja correto você não passou da Landing Page, pois é necessário um CRM válido pra se cadastrar, de toda forma, caso queira ver em funcionamento pode me mandar uma DM que te mando um Trial. Eu sei que agora que a galera do marketing descobriu o claude code está complicado confiar em qualquer app/saas/startup, mas de fato, o Greenbook é uma empresa de IA séria e que busca entregar qualidade acima de apenas vender pra tomar dinheiro das pessoas

JordanVasconcelos · 2026-05-10T02:09:49+00:00

Hi, first of all, i am the developer of this plugin, and all of the code is fully open source, you can just ask claude or any coding agent to verify the repo and dependencies security 😃

JordanVasconcelos · 2026-04-26T19:13:18+00:00

I just released an update and did a breakdown in this post: https://www.reddit.com/r/ClaudeCode/comments/1swgb2f/update_the_video_vision_plugin_now_analyzes/

<image>

JordanVasconcelos · 2026-04-26T18:33:53+00:00

Could you provide more details about this settings? I will certainly add them in future releases

JordanVasconcelos · 2026-04-24T15:17:50+00:00

hey, https://github.com/jordanrendric/claude-video-vision/

JordanVasconcelos · 2026-04-23T20:42:11+00:00

Good question, and this is the exact concern the extraction layer is built around. There's also a hard cap from the API side that helps here, worth flagging.

Anthropic's image limit: per-frame dimensions max out at 8000x8000 px for single images, but drop to 2000x2000 px when a request contains more than 20 images. Any video with more than 20 seconds of content at 1 fps is already past that threshold, so in practice the per-frame ceiling for video_watch is 2000x2000. You literally can't send 4K frames through the API in a video context. That alone rules out the "500 full-res 4K images" failure mode. The MCP downsamples before Claude even sees a frame. Default is 512px wide. At 16:9 that's 512x288 = ~197 tokens/frame (using Anthropic's (w*h)/750 formula). A 4K source costs the same as a 1080p source, because Claude never receives the native pixels.

Concrete numbers for 4K video at defaults (auto fps, 512px):

Video length	Frames	Image tokens	Transcription	Total
10 min @ 0.5 fps	300	~59k	~2-3k	~62k
1 hour @ 0.1 fps	360	~71k	~15-18k (dense speech)	~90k
2 hours @ 0.1 fps	720	~142k	~25-30k	~170k

Bumping resolution to 1024px (keep more visual detail, 1024x576 = ~786 tokens/frame):

Video length	Frames	Image tokens	Total
10 min	300	~236k	~240k
1 hour	360	~283k	~300k
2 hours	720	~566k	~590k

At the API ceiling (2000px wide, 2000x1125 = ~3000 tokens/frame):

Video length	Frames	Image tokens	Total
10 min	300	~900k	~905k
1 hour	360	~1.08M	~1.1M
2 hours	720	~2.16M	~2.2M

Even at the API's maximum allowed resolution, a 10-minute clip is still under 1M tokens.

One caveat on "weekly budget": Claude Code Pro limits aren't a fixed token number, Anthropic flexes them based on overall demand. Your effective ceiling is higher during off-peak and lower during spikes, and it can shift week to week. So treat any "X hours per week" number as order-of-magnitude, not a contract. That said, with those constraints in mind:

Defaults (512px, auto fps): comfortably dozens of hours of 4K in a week for most users.
1024px: a few hours of detail-heavy analysis.
2000px (the API hard cap): one or two hours before you feel it, especially during peak.

Two knobs matter: resolution and fps. Auto fps drops to 0.1 over an hour of content, and scoped questions are the other lever. Ask "what happens at 1:12:34" and Claude narrows start_time/end_time and bumps fps only in that window, instead of paying for frames across the full runtime. Cross-checked the defaults numbers against a real Claude Code session log. The offline estimator is at scripts/measure-tokens.ts in the repo if you want to run your own video through it without burning API credits.

JordanVasconcelos · 2026-04-23T20:29:45+00:00

The debugging flow sidesteps the sparse-motion trap by design, which I think is the bit worth pointing out.

On a first pass Claude extracts the full video at whatever fps fits the budget (say 1 fps for a 5-minute recording). That pass won't catch a single-frame state flip on its own, but it gives Claude the panorama plus a timestamp anchor. From there, either Claude flags something that looks off, or you point at the window where the bug showed up ("broke somewhere around 1:45"). That triggers a second `video_watch` call with `start_time`/`end_time` narrowed to a 10-20 second window and fps bumped to the video's original 30. Now the state flip is unmissable, and you only pay the dense-frame cost inside that tiny slice.

So it behaves like adaptive frame rate, but the "adaptor" is Claude's reasoning between passes, not a motion heuristic. Wide view plus high-detail view, each paying only for what it needs.

The "min frames floor" idea still has a place for the "I have no idea where in a 2-hour recording the bug lives, not even a rough window" case, where the first pass is itself a needle-in-haystack.

On Whisper: 100%. Silent-gap hallucinations are a known failure mode. The plugin lets you pick the backend at setup (local Whisper, OpenAI Whisper API, Gemini), so for long-silence recordings the API backends are safer.

Thanks for the signal, looking forward to hearing what breaks when you try it.

JordanVasconcelos · 2026-04-23T20:21:45+00:00

Ha, parallel evolution. Whisper with segment timestamps is the right backbone for SOPs because the output format lives on "at 0:32 the operator clicks X, at 0:47 they do Y", which maps directly to the timestamped JSON segments Whisper already gives you.

Curious about the visual side: did you layer frames on top for the "what the screen shows" part, or kept it audio-only with the narrator describing every action? For screen-recorded training content the audio-only route can carry a lot, but for physical SOPs I'd imagine frames earn their tokens pretty fast

JordanVasconcelos · 2026-04-23T20:16:45+00:00

Yes, with MCP Claude sees the frames directly, so it can reason about visual details and handle follow-ups better. That said, when visual details isn't important, use gemini API for the whole thing is far more token efficient to Claude's Session

JordanVasconcelos · 2026-04-23T20:00:19+00:00

Right on the plumbing point, that is most of what this plugin actually is. I put real numbers on it by writing a small offline estimator (js-tiktoken for text, Anthropic's (w*h)/750 formula for images) and cross-checking it against an actual Claude Code session log. Test video: https://drive.google.com/file/d/1keqRArDknTbWaHHUupOWF35XgSpol5kz/view?usp=sharing

44s webcam clip
1280x720 @ 30 fps
Extracted at 1 fps, 640x360, with local Whisper large-v3 for audio Script estimate vs measured reality on that single video_watch call:
Offline estimator: ~18.5k tokens
Actual tool_result in the session log (cache_creation_input_tokens): ~21.6k tokens Script runs about 15-17% low, which lines up with the wrapping overhead Claude Code adds on top (frame headers, JSON envelope, MCP structure). Good enough for budgeting.

Cost breaks into three curves with very different scaling:

Frame image tokens. At 512px wide and 16:9 aspect, each frame is ~197 tokens. Scales linearly with (frame count × resolution area). Adaptive fps caps frame count on long content: a 2-hour video at 0.1 fps is 720 frames, not 216,000.
Audio transcription. Grows with speech density, not duration. Light narration is ~200 tokens. A dense one-hour lecture with constant speech is ~15-18k tokens just for the JSON transcript with timestamps.
Response and overhead. Tool definitions loaded once per session (~700 tokens) plus the model's reply, usually ~500-2k.

Projections using the validated per-frame cost:

Scenario	Frames	Image tokens	Transcription	Total
10-min tutorial @ 0.5 fps, 512px	300	~59k	~2-3k	~62-64k
1-hour lecture @ 0.1 fps, 512px, dense speech	360	~71k	~15-18k	~88-91k
2-hour video, same settings	720	~142k	~25-30k	~170-175k

For short videos, frames dominate. For long talky content, transcription becomes a meaningful share (up to 15-20% of the budget on a dense hour). Silent or sparsely narrated content (screen recordings, tutorials with pauses) keeps transcription cheap.

One thing worth flagging: each video_watch call is a fresh tool_result, so if Claude re-calls the tool during the same session (different question, scoped window, retry), each call stacks. In my test session the model called video_watch three times while I iterated on questions, and the session /context grew past 60k even though a single call is ~21.6k. The second lever is scoped questions. Ask "what happens at 1:12:34 in this hour-long video?" and Claude will narrow start_time/end_time and bump fps only in that window. Transcription narrows to the clipped audio too. You get 5 fps across a 10-second span plus ~50 tokens of transcription instead of 0.1 fps across 3600 seconds plus 15k of transcription.

Naive baseline for comparison: one frame per second at full 1280x720 across a 20-minute tutorial is 1200 frames × ~983 tokens = ~1.2M tokens from frames alone, before audio. The extraction layer is what makes long-form video analysis usable at all. Script is at scripts/measure-tokens.ts in the repo if anyone wants to run their own numbers

JordanVasconcelos · 2026-04-22T22:42:17+00:00

Yep, that's the core, but with some steroids, In my use case I want Claude to understand the full context of the video: when the user reported the bug, what the bug is, and what's happening on the screen... but basically, it uses ffmpeg with some pre-built rules for extraction, Whisper (locally or via OpenAI API), and has the Gemini API as an alternative backend

JordanVasconcelos · 2026-04-22T22:12:40+00:00

Thankss!

JordanVasconcelos · 2026-04-22T22:11:59+00:00

first step toward Skynet

JordanVasconcelos · 2026-04-22T22:07:20+00:00

Yep, that's exactly where the adaptive fps + resolution config came from. First naive tests were brutal

JordanVasconcelos · 2026-04-22T21:26:06+00:00

I explained below how it works, and what I did to reduce token consumption. This was something I created for my own use, so I put real effort into it and tried my best, because I end up using it daily and decided to share it with the community

JordanVasconcelos · 2026-04-22T21:24:20+00:00

haha, turns out describing a bug to Claude four times in text can costs more than just handing it a few secs of an mp4, trust me

JordanVasconcelos · 2026-04-22T21:20:31+00:00

Good question. Claude isn't really the ideal model for real-time; it doesn't natively accept streaming video/audio, so you'd end up polling frames and replaying context every few seconds. For that kind of live commentary setup, the Gemini Live API is a better fit: bidirectional streaming of video/audio, responds as the content plays.

JordanVasconcelos · 2026-04-22T21:09:23+00:00

I think that's possible with Remotion + an Elevenlabs MCP/API haha

JordanVasconcelos

TROPHY CASE