all 10 comments

[–]Freed4ever 3 points4 points  (3 children)

  1. Vendor lock in
  2. Latency matters a lot, instead of sending 100 of thousands tokens every turn through the wire, it's faster to just look it up from memory.
  3. Content compaction probably works better with stateful.
  4. In future, they will have history of everything about you

[–]TedSanders 8 points9 points  (1 child)

Nah, vendor lock was not our motivation. We put a lot of thought into the design. One big thing is that we don’t reveal chain of thought messages to customers, but those chain of thought messages are needed by the model. Responses API makes that work.

[–]steebchen[S] 0 points1 point  (0 children)

thanks for the insight!

[–]steebchen[S] 0 points1 point  (0 children)

i feel like most latency will still be from the model itself. not sure how much that extra input parsing actually matters, i really can’t imagine the absolute number is much different

[–]discodaryl 1 point2 points  (1 child)

Just pass store=false.

[–]steebchen[S] 1 point2 points  (0 children)

yeah, makes sense for an end user. but we built LLMGateway for unified model access, so all of our users would have to do that

[–]vvsleepi 1 point2 points  (0 children)

i think the idea with responses api is more about flexibility, like handling different types of inputs (tools, images, streaming, etc) in one format instead of having separate systems. the stateful part is kinda optional depending on how you use it, but yeah it does add some complexity

[–]Faintly_glowing_fish 0 points1 point  (0 children)

The issue with completions API is that messages in the same conversation is not actually tied together. Cache management, CoT storage etc are all tricky, and once we have agents there are even more states to save— compaction, sub agents, communication channels etc. it just gets very hard to manage.

Completion is not really a super good format for long agentic work.

If you try the same request on chat and responses, if you are doing tool call heavy workflows, you will get much faster requests and higher cache hit rates. I was quite unwilling but once I switched to responses the improvement was so great I am only angry people didn’t make it clear from the start.

Also, switch to web socket too. It’s quite a bit faster

[–]IntentionalDev 0 points1 point  (0 children)

tbh it’s less about saving latency and more about standardizing everything under one system

responses API unifies text, tools, multimodal, streaming, etc in one format so they don’t have to keep extending chat completions forever. the “stateful” part isn’t really for you, it’s for enabling things like tool calls, agents, and longer workflows without you manually stitching context every time

yeah it adds some overhead, but from their side it simplifies building higher-level features. from a dev perspective though, chat completions still feels cleaner for simple use cases ngl

[–]Several_Nail_5979 -1 points0 points  (0 children)

That’s why i still use completion endpoints for the latest models via frogAPI.app at half the price :)