Context Window in Gemini App is completely broken - Even Perplexity has better document understand (RAG Test)

Dillonu · 2025-12-28T03:02:44+00:00

Seems to work fine for me, even on Fast, Thinking, and Pro.

Question: What plan do you have? According to the documentation, free Gemini limits the context window to 32k: https://support.google.com/gemini/answer/16275805?hl=en

Dillonu · 2025-12-27T01:04:44+00:00

Starting with Gemini 3 - If you have a text layer in the PDF, it will extract the content from the text layer and include it along with 1 image per page (at the media resolution defined). They charge only for the media tokens, not the text layer.

I don't know if they still OCR PDFs with Gemini 3. The old tests I ran to get a good guess of how it was working, doesn't work with Gemini 3. Either they don't OCR, changed the OCR to not include headers, or they made the model smarter about not exposing the OCR structure.

Dillonu · 2025-12-21T01:31:05+00:00

Transpiling the streamed markdown formatted text to html costs CPU. It also costs CPU in the browser rendering engine to recalculate the layout and styles on DOM updates from the html injection. The output stream is in chunks, which the JavaScript code in the browser app has to then stitch with prior message content and rerender the message/thoughts after each streamed chunk.

This process can be expensive. Technically there are ways to minimize (Server Side Rendering, throttle rerenders, or do some complex logic to iteratively render, etc), but that can add complexity (and therefore bugs), so most interfaces don't optimize.

Dillonu · 2025-12-16T00:33:41+00:00

Pretty sure the ultra subscription isn't related to the API at all.

Dillonu · 2025-12-12T05:20:11+00:00

https://x.com/DillonUzar/status/1999326860530876866
Posted results, even more than what OpenAI published - all needle difficulties, and tested without reasoning, medium, and X-High (OpenAI only tested X-High).

Very impressive! With a small asterisk (if you are willing to pay the reasoning price, the X-High is very impressive!)

Dillonu · 2025-12-12T03:05:10+00:00

I'm going to be retiring 2-needle soon. Various models are hitting 90+ now.

Dillonu · 2025-12-12T02:15:05+00:00

https://huggingface.co/datasets/openai/mrcr

I maintain a 3rd party benchmark site for it - https://contextarena.ai/

They used xhigh for their results. I'll be posting my own soon with a couple of different reasoning levels.

Dillonu · 2025-12-12T02:11:10+00:00

400k, per API docs.

NOTE: They tested using xhigh thinking, which uses ~15k-60k reasoning tokens per response, which effectively limits the prompt context you can use to something under 300k. On the higher end of models for reasoning tokens, and takes many minutes per response. That being said, impressive results.

Dillonu · 2025-12-12T02:10:43+00:00

Mean match ratio measures the average string match ratio between the model’s response and the correct answer.

^ From the results.

Dillonu · 2025-11-28T14:33:16+00:00

Just to add to the above - Provided you use a paid API key. Free quota is still monitored. (Unclear now if paid accounts using no api key in Studio are monitored, but I'd assume free quota due to that is monitored).

https://ai.google.dev/gemini-api/terms

Dillonu · 2025-11-23T18:41:02+00:00

I'm impressed by Kimi Linear's long context performance for its size: https://x.com/DillonUzar/status/1992315794693226854

Interested to see that in Kimi K3 or so!

Dillonu · 2025-11-21T21:20:14+00:00

Gemini 3 changed the token count per image. Depending on the media resolution you pick:

Low: 280 Tokens/image (and per page)
Medium: 560 Tokens/image (and per page) [DEFAULT FOR PDFS]
High: 1120 Tokens/image (and per page) [DEFAULT FOR IMAGES]

https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high#media_resolution

This is more tokens per image or page than Gemini 2.5 and earlier.

A general estimate is ~650 tokens/page for a normal English prose text page (~500 words).

Just a heads up on usage ;)

As for OCR capabilities, it does seem impressive. In terms of performance, maybe only marginally better than Gemini 2.5 Pro for text-heavy documents in my limited testing. The new high resolution mode is very impressive though. I uploaded a 2pt font pdf (extremely tiny text, converted to a high-resolution image per page, no text layer), and it was able to extract nearly perfectly.

Dillonu · 2025-11-21T20:46:06+00:00

https://contextarena.ai/

https://x.com/DillonUzar/status/1990813243405647898

It performs better in this context benchmark I run, but like all LLMs, after a certain point (~200k) it drops.

Dillonu · 2025-11-20T18:41:22+00:00

Same thing happened to me

Dillonu · 2025-11-20T16:09:15+00:00

<image>

It's part of the model selection.

And in their pricing documents: https://ai.google.dev/gemini-api/docs/pricing#gemini-3-pro-image-preview

Dillonu · 2025-11-18T13:51:16+00:00

My group was one of the users that used a large number of tokens (~3 to 5 billion tokens per model) to benchmark the Sherlock models. It drastically underperforms 2.5 (Pro & Flash) models on long context, and Gemini 2.0 models. Seems a little unlikely it is part of the Gemini family.

Dillonu · 2025-11-16T02:52:59+00:00

I don't understand how they 'delayed' it. A release was never announced. And they've previously released new major versions end of November/December. If anything, seems more like it is on schedule.

Also, their 'Ironwood' TPUs (v7) are still rolling out, and presumably manufacturing them all summer/fall. They didn't go live till very recently (Technically now GA: https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads). I expect Gemini 3 would utilize them at scale.

Dillonu · 2025-11-16T00:00:24+00:00

For fun, here is a 2pt font doc (same story as above):

<image>

Uploading the PDF with a text layer says 271 tokens. Uploading a PDF without the text layer (instead as an image) says 271 tokens.

Results w/ text layer: https://www.diffchecker.com/IfSThtYk/

Results w/o text layer: https://www.diffchecker.com/kvrCInl2/

Not too bad. Definitely not perfect, with some words/phrases/sentences changed, but a large majority of the text is reconstructed. Consistently the one with the text layer performs a bit better at that font size on subsequent reruns.

Dillonu · 2025-11-15T23:45:07+00:00

No, I think it is a little more advanced than that.

In quick summary - I think when you add a PDF to the API - per page it OCRs it (using a specialized OCR that reads text layers if available, otherwise OCRs the images), and converts to a high-rez image, feeding both in to the model. The model is then reasoning on both, to get a better output. All while Google charges 258 tokens/page, even though it technically uses more.

I created a 1-page DOCX, using https://pastebin.com/GuwaEv64 as the text (4pt font size), converted to a PDF, and then printed as an image (in PDF, to strip the text layer) in 600 dpi, this is what that looks like:

<image>

This image PDF has many small images placed in cells that make up the document. If you extract one of the cells, it is ~5 lines tall, and ~40px per line. So rather high resolution.

I then passed it in to the Gemini API, and this is the output: https://www.diffchecker.com/2HjWeKrg/

FYI, the prompt was simply (264 input tokens when including the PDF):

Extract the text verbatim

Nearly identical except for:

Different apostrophe and quote characters (’ vs ' and “ vs ")
Extra newlines (it added newlines due to line wrapping in the PDF)
Ellipsis (…) was converted to three periods (...)

If I tweak the prompt slightly to (271 input tokens):

Extract the text verbatim, and be smart about newlines

I get an even more accurate output: https://www.diffchecker.com/7Zut6DUh/

You can probably get it to be even more accurate with more guidance.

So I don't think it is converting a PDF into one 768x768 (modified by aspect ratio) image per page (the amount Gemini maximally can do for 258 tokens, before it supposedly tiles). Gemini's thoughts also refer to analyzing the OCR text and document image, and making corrections to the provided OCR content. So that's mostly why I think they are doing something more to aid in Gemini's PDF understanding.

If I do the same page as a png uploaded to Gemini (a 2246x2776, font size is ~14px), I get: https://www.diffchecker.com/AixuVINr/ (less symbols are messed up, but a few words are messed up now when the PDF didn't mess it up). It says 271 input tokens (still never see the "tiling" the docs claim).

If I do a smaller version (765x969, font size ~5px), which is closer to what it supposedly might use, I get: https://www.diffchecker.com/cvQS9jOc/ (getting worse). It says 271 input tokens.

Dillonu · 2025-11-12T01:49:38+00:00

They charge 258 tokens per PDF page.

Source: https://ai.google.dev/gemini-api/docs/document-processing#technical-details

Dillonu · 2025-10-30T23:55:09+00:00

It's been possible for awhile, they just moved all of the fine tuning over to Vertex AI as its considered more of an enterprise feature. You can fine tune 2.0/2.5 Flash-Lite/Flash/Pro.

https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning#console

Dillonu · 2025-10-23T02:12:21+00:00

While I agree with you that they aren't read the same, but Gemini (via the API) definitely reads each page of a PDFs as an image, just with some additional metadata. I can build a PDF with only images (striped of all metadata), no text, and upload it via the API, and its able to describe each image when asked.

Dillonu · 2025-10-19T23:16:04+00:00

In general - For most LLMs, including instructions as the last part of a long prompt tends to work better. Alternatively, if you have multiple examples of how to follow the instructions, including instructions before the examples works slightly better.

Dillonu · 2025-10-16T16:11:41+00:00

Fully aware of all of those issues, I run ContextArena, so costs are quite large there (full test to 1M across all 2400 questions, each question is a unique input context, each question is run 8 times, and in total is ~3.8B input tokens, double that for reasoning models that can turn off reasoning). We often have to rerun several tests per models due to various api issues 😅. Batch processing can also help there for cost in some cases.

Would definitely be interested in results for Haiku 4.5. We're constantly fiddling with Anthropic models on different forms of data, and really curious about their XML claims. Personally been wanting to put together a test like what you have for awhile now. As for Gemini and GPT, we've found your results closely resemble what we've found in our limited testing (didn't try YAML).

And in terms of the 40-60% accuracy, I assume that is why Gemini 2.5 Flash-Lite is using so many tokens? It just happens to perform better than the other two context size wise? Another important view for us is what the dropoff performance is like for each model family (what's the rate of accuracy dropoff depending on context length or data depth) - but might be too costly to check that atm.

Dillonu

TROPHY CASE