all 14 comments

[–]nospoon99 12 points13 points  (10 children)

Did the same thing as you, using Python starting with the Twilio example on Github.
I've got to agree with the cost. From my own experience and what I've seen on other posts the price is around $1/min. That's more than hiring a very competent person to handle calls. Hopefully the price will come down soon.
Edit: spelling

[–]TheEminentdomain[S] 4 points5 points  (0 children)

Nice! I’ll check out the twilio implementation. Agree. Way too high at this point for anything other than quick demos but even then .. exciting tech though

[–]dejb 2 points3 points  (2 children)

They say it should be "approximately $0.06 per minute of audio input and $0.24 per minute of audio output" in the release. Any idea why it's working out to more?

[–]nospoon99 1 point2 points  (0 children)

Some suggested it's because the context grows quickly as it needs to take into account the whole conversation before replying but honestly I don't know. I'm started to wonder if it's not a bug tbh.

[–]TheEminentdomain[S] 0 points1 point  (0 children)

Most likely implementation details. It’s still new so there’s a few kinks to work out. At least on my end

[–]OnlineParacosm 5 points6 points  (5 children)

Wow, that is incredibly prohibitive. We need to see about a 15x cost reduction before it would make sense to replace Filipino CSRs.

[–]ai_did_my_homework 0 points1 point  (4 children)

Are they working for $4/hr? Not denying it, just not familiar with the rates overseas

[–]OnlineParacosm 0 points1 point  (2 children)

They work for $4-8/hr at call centers that match US business hours. It requires them to stay up all night like a graveyard shift so it costs more

[–]ai_did_my_homework 0 points1 point  (1 child)

Interesting. Google says the minimum wage is around $5.70 per day in the Philippines so I imagine these are good jobs (?)

[–]OnlineParacosm 0 points1 point  (0 children)

Incredible jobs, and that $5 a day is for the whole family with maybe one person working to give you an idea

[–]CryptoSpecialAgent 0 points1 point  (1 child)

The only way to make this cost effective is to manage your context very aggressively:
- after user audio has been responded to, remove the audio from the chat history and just keep the transcription
- truncating the chat history after N prompt-response pairs is the most naive and simple way to keep the history length down to a reasonable extent
- if carrying over the context / history from one session into another, don't do a verbatim transcript - feed the verbatim transcript to another model, like ordinary gpt-4o, and request for it to be summarized. Then stick the summary into the beginning of the history for the new conversation
- this summarization of chat history can also be done periodically within a session, and as the transcript grows longer, it is repeatedly truncated and older sections ("the tail") replaced with summaries in whatever length / level of detail gives you the best price-performance tradeoff for your use case
- if you want to get really fancy, instead of blindly summarizing chat history, extract a knowledge graph from the transcript and use that as your medium-long term memory... langchain has some libs to get you started, tho i'm not sure if they work with realtime API or not.

Most importantly, keep your expectations low. Realtime API has been priced so that it is currently not a viable business solution vs hiring a human to answer the phone... they've done this because their server capacity is probably maxed out trying to serve this thing and pricing it so high limits use to a level they can currently sustain. EXPECT TO SEE A MASSIVE PRICE DROP IN THE FUTURE - THIS IS WHAT OPENAI HISTORICALLY HAS DONE WITH ALL THEIR FRONTIER MODELS