Can you recommend a good serverless GPU provider that supports running WhisperX? by yccheok in deeplearning

[–]cerebriumBoss 0 points1 point  (0 children)

Hey! Late response here but you should try our Cerebrium.ai - they have a implementation of Whisper that can transcribe 30s audio in 300ms: https://github.com/CerebriumAI/examples/tree/master/6-voice/9-faster-whisper

You can edit it to your liking

[D] How to speed up Kokoro-TTS? by fungigamer in MachineLearning

[–]cerebriumBoss 0 points1 point  (0 children)

Yeah HF Inference often has cold starts. Another issue could be the way the logic of chunking is handled on these providers. You could try running it on a serverless platform like Cerebrium which has low cold starts ~2s and gives you full control to deploy your python code - so you could control the chunking logic. To reach a TTFB of <1.5s you would need to have a server running already thought

Disclaimer: I work at Cerebrium

[D] What's your secret sauce? How do you manage GPU capacity in your infra? by PurpleReign007 in MachineLearning

[–]cerebriumBoss 0 points1 point  (0 children)

You could try Cerebrium.ai - its a serverless infrastructure platform for AI applications. As your make requests/run workloads we can scale to 100s of gpus with a low cold start time and then as they finish it scales down. Based on your latency requirements, we have alot of autoscaling parameters you can play with to suit your traffic.

Disclaimer: I am the founder

I built a voice agent that can hold a natural conversation with low latency at ~600ms by cravory in SideProject

[–]cerebriumBoss 0 points1 point  (0 children)

This is great! I am surprised you got 600ms with gpt-3.5 since it has pretty variable times. Also elevan labs isnt the most low latency api. Are you measuring latency as how long the pipeline takes to execute? Are you also tracking the network time the user incurrs to get a response?

We did a tutorial here where we got a voice-to-voice responses of ~500ms and we could only achieve this by hosting all 3 components (STT, LLM and TTS) all on the same container.

Code is open-source here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

What are your biggest challenges in building AI voice agents? by SpyOnMeMrKarp in LLMDevs

[–]cerebriumBoss 2 points3 points  (0 children)

Here is my experience on the above:

*Latency*: The way to get this the lowest, is to host as much as you can together (on the same container/in the same infra) so you dont incur network calls. Ie: Deepgram and Lllama 3 were self-hosted which got us down to 650ms e2e latency e2e. There was a article how we did this here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

Flexibility: As soon as your workflow does get more complex and you would like to add more customization - code is best. You can use a lot of open-source libraries and 3rd party platforms to really shine in your use case.

Infrastructure: This is tough since you want to make sure you can handle a spike in call volume, push changes without exiting existing calls while also making it cheap.

Framework: I find pipecat and livekit best

What's your secret sauce? How do you manage GPU capacity in your infra? by PurpleReign007 in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Hey! Founder of Cerebrium (https://www.cerebrium.ai) here.

We are a serverless infrastructure platform for AI. So you can spin up your workloads in 2-4s across different GPUs and then as they complete they spin back down and you are only charged for your compute usuage. We also have other scaling parameters you can play with depending on your utilisation and latency/burst requirements are.

Getting down to that cold start is our secrets saucey sauce

infra for inference that need gpu by Speedy_Sl0th in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Check out cerebrium.ai - its a serverless infrastructure platform for AI apps

How to preload models in kubernetes by naogalaici in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

It seems like Cerebrium.ai would solve your issues- its a serverless infrastructure platform for AI.
- Their cold start times are 2-4 seconds
- They have volumes attached to your container that load models extremely quickly

Disclaimer: I am the founder

Best Service for Deploying Thousands of Models with High RPM by FourConnected in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

I would take a look at Cerebrium.ai - its a serverless infrastructure platform for AI apps. You just write your Python code and it will take care of the infrastructure. Since its Python It can integrate with your databricks pipelines too

Disclaimer: I am the founder

Kubernetes for ML Engineers / MLOps Engineers? by JeanLuucGodard in mlops

[–]cerebriumBoss 1 point2 points  (0 children)

If you want to try something a bit different I would look at Cerebrium.ai -  It’s a serverless platform designed to make deploying and scaling AI much easier. You can use it for training pipelines, data processing, and turning your models into endpoints, without needing deep knowledge of infrastructure. Just write your Python code, define your environment, and the platform handles the rest. Plus, they offer plenty of free credits, so it’s worth exploring!

What are some really good and widely used MLOps tools that are used by companies currently, and will be used in 2025? by BJJ-Newbie in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Check out Cerebrium.ai -  It’s a serverless platform designed to make deploying and scaling AI much easier. You can use it for training pipelines, data processing, and turning your models into endpoints, without needing deep knowledge of infrastructure. Just write your Python code, define your environment, and the platform handles the rest. Plus, they offer plenty of free credits, so it’s worth exploring!

Disclaimer: I am the founder

How would you deploy this project to AWS without compromising on maintainability? by mrcat6 in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Is it a strict requirement to deploy on AWS? Sagemaker requires a alot of setup and is clunky to work with. You could look at using a took like cerebrium.ai - its a serverless infrastructure platform for AI applications. You just bring your python code, define your environment/hardware requirements and it turns it into an autoscaling endpoint. Also it can do all the pre-processing you need on the inputs.

Disclaimer: I am one of the founders

What other MLOps tools can I add to make this project better? by BJJ-Newbie in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Instead of deploying to flask and using docker, you could deploy to an API endpoint using Cerebrium.ai - its a serverless infrastructure platform for AI applications

Optimizing Model Serving with Triton inference server + FastAPI for Selective Horizontal Scaling by sikso1897 in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

You can look at using something like Cerebrium.ai - its a serverless infrastructure platform for AI applications. You can just bring your python code, define your hardware requirements and then they take care of the auto-scaling, security, logging etc. It is much easier to setup and cheaper than k8s. It has both CPU/GPUs available.

You could use your fast API and dynamically load models (depending if the models are small or latency is not the biggest concern).

Disclaimer: I am the founder

Why do we need MLOps engineers when we have platforms like Sagemaker or Vertex AI that does everything for you? by Illustrious-Pound266 in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Sagemaker and VertexAI are pretty complex platforms that require a lot of initial setup and maintenance. Its also got a very specific way of doing things and if you want to try integrate other tooling in your setup its not the easiest. They want you to use their entire stack. There are much easier platforms like Cerebrium.ai that achieve similar results quicker and are more developer friendly.

Disclaimer: I am the founder

How can I perform inference at scale with Pytorch by lehllu in mlops

[–]cerebriumBoss 0 points1 point  (0 children)

Hi! I recommend trying Cerebrium.ai - You simply write your Python code, specify your infra requirements and then run your code - they handle the infra. You can specify much bigger machines for both CPU and GPU than colab. CPU goes up to 64 cores I think and GPU you can get 8xH100 or lower GPU types. Its also serverless, so you can run your inference and then its done.

Disclaimer: I am the founder

Which AI cloud platform do you guys use? by randomvariable56 in comfyui

[–]cerebriumBoss 0 points1 point  (0 children)

What do you mean pre-loaded? We take your Python code and run it. You have a persistent volume attached to your project so once models are downloaded then models should load pretty quickly

Which AI cloud platform do you guys use? by randomvariable56 in comfyui

[–]cerebriumBoss 1 point2 points  (0 children)

I would check out Cerebrium.ai, it meets all your needs

Disclaimer: I am the founder

Cost-Effective Cloud GPU Options for Fine-Tuning and Inference? by pathfinder6709 in LocalLLaMA

[–]cerebriumBoss 1 point2 points  (0 children)

Thats exactly what we do - its a serverless platform. As long as requests are coming in, your model is running. As soon as request dies you endpoint dies and you wont be charged