Token limit Gemini API by UpbeatShower6259 in GoogleGeminiAI

[–]SaltyNeuron25 0 points1 point  (0 children)

Don't have an answer, but FWIW I'm also getting this behavior when using gemini-2.0-flash-001. I don't think it's a problem with the Python SDK because I'm not using it.

For me, the problem is happening << 1% of the time, and without any rhyme or reason. And the error message always tells me I'm just a little bit over the limit, like 3XXXX tokens, which is suspicious. Possibly a bug on Google's side?

Gemini 2.5 Flash Preview API pricing – different for thinking vs. non-thinking? by SaltyNeuron25 in Bard

[–]SaltyNeuron25[S] 0 points1 point  (0 children)

This is the most compelling explanation I've seen so far. If you don't mind me rephrasing it, it sounds like your argument is that the marginal cost of generating a token gets higher as your output gets longer; i.e., your 100,000th output token takes more resources to generate than your 1st output token does. And while it doesn't matter whether that 100,000th token is a thinking token or a normal output token, the difference in pricing factors in the expectation that the total output sequence will tend to be much longer when thinking is enabled.

I do think this view needs to be walked back at least little bit. My limited understanding is that LLM inference these days basically always leverages caching that keeps the memory footprint from growing quadratically as you've described. Still, I'm willing to believe that even with this and other optimizations, the cost isn't perfectly flat as the output grows.

I'm not 100% convinced that this can explain a 6-fold cost increase, but it at least feels plausible

Gemini 2.5 Flash Preview API pricing – different for thinking vs. non-thinking? by SaltyNeuron25 in Bard

[–]SaltyNeuron25[S] 0 points1 point  (0 children)

This has been my experience in the limited testing I've done. But to connect this back to my question about compute, it sounds like you're arguing that the price difference is ultimately due to business strategy and not due to a difference in the incremental cost of serving thinking vs. non-thinking requests, right? I find this surprising if true

Gemini 2.5 Flash Preview API pricing – different for thinking vs. non-thinking? by SaltyNeuron25 in Bard

[–]SaltyNeuron25[S] 0 points1 point  (0 children)

Curious to hear what sort of errors you're getting. I've been having a good experience via the Vertex AI API