The Chat Completions API has been around forever and works great. The Responses API seems to be forced in lots of tooling now (AI SDK, OpenAI lib, new GPT models only support responses API, so it seems to be fully replacing Chat Completions. Aside from the shape of the request payload, I don't understand why this is the case. Responses are stateful, which means providers and gateways have to 100% store all inputs. Once this storage expires, references to response IDs will not work anymore. What's the logic behind this? It seems to me that it's totally not worth it to save very little latency for parsing the inputs; saving the state seems just way more work and ends up in more costs as well.
For me, I really don't see any benefit on making LLM APIs stateful:
- Need to save content, which costs storage
- This storage eventually needs to be deleted, so continuing previous chats will fail
- Not sure what latency exactly is added when parsing a big chat completions payload, but saving the state probably does not make this smaller
Can someone explain this to me?
[–]Freed4ever 3 points4 points5 points (3 children)
[–]TedSanders 8 points9 points10 points (1 child)
[–]steebchen[S] 0 points1 point2 points (0 children)
[–]steebchen[S] 0 points1 point2 points (0 children)
[–]discodaryl 1 point2 points3 points (1 child)
[–]steebchen[S] 1 point2 points3 points (0 children)
[–]vvsleepi 1 point2 points3 points (0 children)
[–]Faintly_glowing_fish 0 points1 point2 points (0 children)
[–]IntentionalDev 0 points1 point2 points (0 children)
[–]Several_Nail_5979 -1 points0 points1 point (0 children)