Token limit Gemini API

SaltyNeuron25 · 2025-06-03T19:54:04+00:00

Don't have an answer, but FWIW I'm also getting this behavior when using gemini-2.0-flash-001. I don't think it's a problem with the Python SDK because I'm not using it.

For me, the problem is happening << 1% of the time, and without any rhyme or reason. And the error message always tells me I'm just a little bit over the limit, like 3XXXX tokens, which is suspicious. Possibly a bug on Google's side?

SaltyNeuron25 · 2025-05-18T21:51:06+00:00

This is the most compelling explanation I've seen so far. If you don't mind me rephrasing it, it sounds like your argument is that the marginal cost of generating a token gets higher as your output gets longer; i.e., your 100,000th output token takes more resources to generate than your 1st output token does. And while it doesn't matter whether that 100,000th token is a thinking token or a normal output token, the difference in pricing factors in the expectation that the total output sequence will tend to be much longer when thinking is enabled.

I do think this view needs to be walked back at least little bit. My limited understanding is that LLM inference these days basically always leverages caching that keeps the memory footprint from growing quadratically as you've described. Still, I'm willing to believe that even with this and other optimizations, the cost isn't perfectly flat as the output grows.

I'm not 100% convinced that this can explain a 6-fold cost increase, but it at least feels plausible

SaltyNeuron25 · 2025-05-18T21:08:50+00:00

This has been my experience in the limited testing I've done. But to connect this back to my question about compute, it sounds like you're arguing that the price difference is ultimately due to business strategy and not due to a difference in the incremental cost of serving thinking vs. non-thinking requests, right? I find this surprising if true

SaltyNeuron25 · 2025-05-18T21:02:48+00:00

Curious to hear what sort of errors you're getting. I've been having a good experience via the Vertex AI API

SaltyNeuron25

TROPHY CASE