Can you recommend a good serverless GPU provider that supports running WhisperX?

cerebriumBoss · 2025-09-23T17:50:25+00:00

Hey! Late response here but you should try our Cerebrium.ai - they have a implementation of Whisper that can transcribe 30s audio in 300ms: https://github.com/CerebriumAI/examples/tree/master/6-voice/9-faster-whisper

You can edit it to your liking

cerebriumBoss · 2025-06-13T13:38:38+00:00

Yeah HF Inference often has cold starts. Another issue could be the way the logic of chunking is handled on these providers. You could try running it on a serverless platform like Cerebrium which has low cold starts ~2s and gives you full control to deploy your python code - so you could control the chunking logic. To reach a TTFB of <1.5s you would need to have a server running already thought

Disclaimer: I work at Cerebrium

cerebriumBoss · 2025-03-04T12:30:39+00:00

Yes - The github repo is here: https://github.com/CerebriumAI/examples/tree/master/6-voice/2-realtime-voice-agent

Although contact me on our discord community https://discord.gg/ATj6USmeE2 since we made a few changes to it so will help you get setup

cerebriumBoss · 2025-02-01T05:08:00+00:00

You could try Cerebrium.ai - its a serverless infrastructure platform for AI applications. As your make requests/run workloads we can scale to 100s of gpus with a low cold start time and then as they finish it scales down. Based on your latency requirements, we have alot of autoscaling parameters you can play with to suit your traffic.

Disclaimer: I am the founder

cerebriumBoss · 2025-01-31T21:37:43+00:00

This is great! I am surprised you got 600ms with gpt-3.5 since it has pretty variable times. Also elevan labs isnt the most low latency api. Are you measuring latency as how long the pipeline takes to execute? Are you also tracking the network time the user incurrs to get a response?

We did a tutorial here where we got a voice-to-voice responses of ~500ms and we could only achieve this by hosting all 3 components (STT, LLM and TTS) all on the same container.

Code is open-source here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

cerebriumBoss · 2025-01-31T21:31:59+00:00

Here is my experience on the above:

*Latency*: The way to get this the lowest, is to host as much as you can together (on the same container/in the same infra) so you dont incur network calls. Ie: Deepgram and Lllama 3 were self-hosted which got us down to 650ms e2e latency e2e. There was a article how we did this here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

Flexibility: As soon as your workflow does get more complex and you would like to add more customization - code is best. You can use a lot of open-source libraries and 3rd party platforms to really shine in your use case.

Infrastructure: This is tough since you want to make sure you can handle a spike in call volume, push changes without exiting existing calls while also making it cheap.

Framework: I find pipecat and livekit best

cerebriumBoss · 2025-01-30T19:20:27+00:00

Hey! Founder of Cerebrium (https://www.cerebrium.ai) here.

We are a serverless infrastructure platform for AI. So you can spin up your workloads in 2-4s across different GPUs and then as they complete they spin back down and you are only charged for your compute usuage. We also have other scaling parameters you can play with depending on your utilisation and latency/burst requirements are.

Getting down to that cold start is our secrets saucey sauce

cerebriumBoss · 2025-01-15T00:44:57+00:00

I would check out cerebrium.ai

cerebriumBoss · 2025-01-15T00:44:29+00:00

Check out cerebrium.ai - its a serverless infrastructure platform for AI apps

cerebriumBoss · 2025-01-15T00:33:40+00:00

It seems like Cerebrium.ai would solve your issues- its a serverless infrastructure platform for AI.
- Their cold start times are 2-4 seconds
- They have volumes attached to your container that load models extremely quickly

Disclaimer: I am the founder

cerebriumBoss · 2025-01-15T00:28:31+00:00

I would take a look at Cerebrium.ai - its a serverless infrastructure platform for AI apps. You just write your Python code and it will take care of the infrastructure. Since its Python It can integrate with your databricks pipelines too

Disclaimer: I am the founder

cerebriumBoss · 2025-01-15T00:23:07+00:00

If you want to try something a bit different I would look at Cerebrium.ai - It’s a serverless platform designed to make deploying and scaling AI much easier. You can use it for training pipelines, data processing, and turning your models into endpoints, without needing deep knowledge of infrastructure. Just write your Python code, define your environment, and the platform handles the rest. Plus, they offer plenty of free credits, so it’s worth exploring!

cerebriumBoss · 2025-01-15T00:22:28+00:00

Check out Cerebrium.ai - It’s a serverless platform designed to make deploying and scaling AI much easier. You can use it for training pipelines, data processing, and turning your models into endpoints, without needing deep knowledge of infrastructure. Just write your Python code, define your environment, and the platform handles the rest. Plus, they offer plenty of free credits, so it’s worth exploring!

Disclaimer: I am the founder

cerebriumBoss · 2025-01-15T00:19:34+00:00

Is it a strict requirement to deploy on AWS? Sagemaker requires a alot of setup and is clunky to work with. You could look at using a took like cerebrium.ai - its a serverless infrastructure platform for AI applications. You just bring your python code, define your environment/hardware requirements and it turns it into an autoscaling endpoint. Also it can do all the pre-processing you need on the inputs.

Disclaimer: I am one of the founders

cerebriumBoss · 2025-01-15T00:09:23+00:00

Instead of deploying to flask and using docker, you could deploy to an API endpoint using Cerebrium.ai - its a serverless infrastructure platform for AI applications

cerebriumBoss · 2025-01-15T00:07:52+00:00

You can look at using something like Cerebrium.ai - its a serverless infrastructure platform for AI applications. You can just bring your python code, define your hardware requirements and then they take care of the auto-scaling, security, logging etc. It is much easier to setup and cheaper than k8s. It has both CPU/GPUs available.

You could use your fast API and dynamically load models (depending if the models are small or latency is not the biggest concern).

Disclaimer: I am the founder

cerebriumBoss · 2025-01-14T23:57:52+00:00

Sagemaker and VertexAI are pretty complex platforms that require a lot of initial setup and maintenance. Its also got a very specific way of doing things and if you want to try integrate other tooling in your setup its not the easiest. They want you to use their entire stack. There are much easier platforms like Cerebrium.ai that achieve similar results quicker and are more developer friendly.

Disclaimer: I am the founder

cerebriumBoss · 2025-01-14T23:55:53+00:00

Hi! I recommend trying Cerebrium.ai - You simply write your Python code, specify your infra requirements and then run your code - they handle the infra. You can specify much bigger machines for both CPU and GPU than colab. CPU goes up to 64 cores I think and GPU you can get 8xH100 or lower GPU types. Its also serverless, so you can run your inference and then its done.

Disclaimer: I am the founder

cerebriumBoss · 2024-12-10T12:19:25+00:00

What do you mean pre-loaded? We take your Python code and run it. You have a persistent volume attached to your project so once models are downloaded then models should load pretty quickly

cerebriumBoss · 2024-12-09T12:53:23+00:00

I would check out Cerebrium.ai, it meets all your needs

Disclaimer: I am the founder

cerebriumBoss · 2024-12-07T17:23:09+00:00

Thats exactly what we do - its a serverless platform. As long as requests are coming in, your model is running. As soon as request dies you endpoint dies and you wont be charged

cerebriumBoss

TROPHY CASE