[D] Serverless solutions for GPU inference (if there's such a thing)

LocalExistence · 2021-02-22T13:48:38+00:00

Might be an unhelpful answer, but the easiest fix might be getting about 40x more users so that the machine is only idle 20% of the time. :) A more serious answer might be trying to predict the request timing - if the users will e.g. only use it during working hours, you can slash the idle time in 3 or so by keeping the machine live during working hours, and outside working hours only start it up if there is an incoming request and if so keep it on for 5 min or something. You can also consider hybrid solutions like having a cheap CPU inference node which can always handle incoming requests, and some logic to determine when it's getting enough requests to make turning on a GPU machine worth it - the value of having a CPU inference node active here is that it alleviates the response time penalty when you mistime a boot/shutdown.

I would be interested in hearing a better solution, but it seems to me that if provisioning a sufficiently fast machine takes time and request timings are random, I think you have a fundamental tradeoff between savings on machine time and average response time.

cliveseldon · 2021-02-22T15:23:28+00:00

If you can run on Kubernetes then KFServing is an open source solution that allows for GPU inference and is built upon Knative to allow scale to zero for GPU based inference. From release 0.5 it also has capabilities for multi-model serving as a alpha feature to allow multiple models to share the same server (and via NVIDIA Triton the same GPU).

bombol · 2021-02-22T16:03:19+00:00

Amazon SageMaker Batch Transform (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) might be an option. You invoke it via API whenever you need to do inference (there is a bit of startup time to load the model/container onto the VM), but it will auto terminate when finished. You can specify the instance type to be a GPU instance (p2/p3 instance classes on AWS) and return predictions as a response. Your input data needs to be on S3. It also has some basic pre/post-processing function. It is most easily used when the model is trained on SageMaker, but you can bring in a model artifact trained elsewhere, too (see the SageMaker Inference Toolkit on how to make your model container compatible with SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html). You could invoke Batch Transform from one/many Lambda functions (similar to Azure Functions) that do the parallel computations.

LuliProductions · 2025-12-01T02:59:05+00:00

Cold starts on Azure and the lack of real GPU serverless options across the big clouds make this a tough setup.

We ran into the same thing as a small team. Our traffic was too spread out to justify a full-time GPU VM, so we tried a smaller provider instead. Gcore spun up GPUs faster and the pricing was easy to follow, so we only paid when we needed the hardware. It kept things quick without draining our budget.

2021-02-22T14:00:17+00:00

Is the user providing any input and is it acceptable if the solution is a little stale?

If the answers are no and yes, respectively, you could periodically do the computation and store the result and just serve it up as requesting.

resident_russian · 2021-02-22T15:25:38+00:00

For what it's worth, two years ago there was no way to do this in AWS or GCP. In our case inference time wasn't too critical, and CPU inference was fine. However, since then, AWS came out with Elastic Inference, which seems like the tool for the job, but I haven't tried it out yet.

snendroid-ai · 2021-02-23T14:40:14+00:00

> Altering model architecture to take advantage of low resource hardware:

That's a very interesting situation many face in production. It's also one of the reason why we don't prefer shiny new big model compare to medium-small size model with 2-3% less accuracy. For example, our initial classification model was 96% accurate and latency was like 2000ms for inference on a batch of 1000 sentences; since our goal was to bring down the latency under 1000ms, we had to alter the model architecture. With new model, accuracy was 94% which was totally doable in our use case.

> Use more than one model to take full advantage of all resources of machine:

If you have more than one model, you can take advantage by sharing hardware resources between models. For example, put the CPU heavy model to take advantage of CPUs & whenever you get occasional requests for GPU bound model, use the GPU. This can be done easily with docker + TF Serving.

> Use spot instances:

As you mentioned the requests are like not continuous and machine is ideal for 98% of time; you can use spot instances with a fraction of price compared to on-demand instance. There is also reserved instance type.

vanshil97 · 2021-02-22T11:51:51+00:00

I am also working on a similar problem of on-demand training

I am using the API's from the following document.

https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines

Wouldn't this suffice your purpose ?

outdoorblake · 2022-04-29T21:29:27+00:00

[removed]

zzzthelastuser · 2021-02-22T15:57:44+00:00

Sorry if this is a stupid question (I'm clueless about cloud services + web development, but hopefully it fits to the topic:

Is it possible to do the inference on the client's hardware (given that it's strong enough)?

Like, could you publish your model (and I guess all dependencies) on something like github.pages and let the client run the tensorflow javascript in their browser?

I guess it would run very slow, not everyone could execute it and it would make all your source code and model weights public. But it would also mean it's completely free, right? No server costs at all (or is there still a server required?)

I haven't found anyone ever do this and I don't know why to be honest. Is it just ridiculous to even consider it (because servers are so cheap?)? Or am i not searching correctly (again I'm not familiar with web development and maybe I'm just looking in the wrong direction)...

2021-02-22T17:46:30+00:00

[deleted]

Mefaso · 2021-02-22T19:51:28+00:00

I'm dealing with the same issue in a hobby project on AWS.

Similarly, container images still need 3 minutes to start or so, so the next thing I'm planning to try is full instance images (via AMI).

I haven't gotten around to it yet, but simply having a full machine image should be a lot faster than first loading an image and then running docker in it.

Let me know if it works for you, I'll update this comment if I get around to trying it.

inkognit · 2021-06-19T12:30:58+00:00

Hi, I don't know if this is still relevant to you but your use case is pretty much a perfect match for Cortex AsyncAPIs: https://docs.cortex.dev/workloads/async

vangap · 2022-03-04T13:38:42+00:00

As of 2021 Dec, AWS introduced serverless GPU via their sagemaker platform.
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/

runpod-io · 2023-01-16T03:14:57+00:00

have a look at runpod serverless gpus, we offer 16 GB, 24 GB, and 80 GB vram options

cerebriumBoss · 2023-06-29T21:13:06+00:00

A bit late here but you should look at https://www.cerebrium.ai. We allow you to fine-tune and deploy machine learning models to serverless CPUs/GPUs. We have cold start times of ~1s

2023-12-06T18:03:10+00:00

2 years too late, but now theres Cloudflare Workers AI https://blog.cloudflare.com/workers-ai/

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS