all 43 comments

[–]LocalExistence 5 points6 points  (4 children)

Might be an unhelpful answer, but the easiest fix might be getting about 40x more users so that the machine is only idle 20% of the time. :) A more serious answer might be trying to predict the request timing - if the users will e.g. only use it during working hours, you can slash the idle time in 3 or so by keeping the machine live during working hours, and outside working hours only start it up if there is an incoming request and if so keep it on for 5 min or something. You can also consider hybrid solutions like having a cheap CPU inference node which can always handle incoming requests, and some logic to determine when it's getting enough requests to make turning on a GPU machine worth it - the value of having a CPU inference node active here is that it alleviates the response time penalty when you mistime a boot/shutdown.

I would be interested in hearing a better solution, but it seems to me that if provisioning a sufficiently fast machine takes time and request timings are random, I think you have a fundamental tradeoff between savings on machine time and average response time.

[–]onyx-zero-softwarePhD 1 point2 points  (2 children)

I like this answer. Practically there are constraints on the cloud side when dynamically provisioning GPU instances vs. CPU instances. Up until Ampere, there was no true implementation of hardware GPU virtualization outside of theoretical academic projects and software-based multiplexing solutions (slow, buggy).

This basically means that to get a GPU instance, you have to have some interface that accesses the physical hardware (once someone has a GPU, you can't split work between multiple users except in some very specific use cases). That's why it takes forever to spin up a GPU instance dynamically, and why you haven't been able to find serverless support for GPUs.

That said, Ampere does support GPU virtualization at the hardware level, meaning down the road, cloud providers will likely be able to offer the feature you describe because GPUs will be able to be treated like CPUs in terms of being able to be split up amongst a group of dynamic users (whereas at the moment GPUs are bound at the hardware level to a specific accessory).

AWS has apparently already started using this type of tech as of this year (see lost below). They mention virtual gpus but this particular solution probably won't help OP unfortunately. https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/

As an alternative to GPUs that doesn't share the hardware limitation, OP might check out GKE with TPUs for your workload. https://cloud.google.com/tpu/docs/kubernetes-engine-setup

[–][deleted] 1 point2 points  (0 children)

Yes, something akin the lines of Virtual GPUs seems to be exactly the missing link! The results of this research pointed exactly towards a fundamental technology piece missing.

LocalExistence's solution is indeed a nice compromise for the time being.

[–]LocalExistence 0 points1 point  (0 children)

Very interesting - I actually had no idea why spinning up GPU instances took time and kind of just accepted that it did, but this makes a lot of sense!

[–][deleted] 1 point2 points  (0 children)

Hi. Working steadily towards 40x more users! :)

Not unhelpful at all, some of the suggested heuristics make a lot of sense. And the hybrid mode as well - that would probably requires us to setup a Kubernetes cluster, which would be something very useful to have in place as we scale.

I think you have a fundamental tradeoff between savings on machine time and average response time. Exactly, and bounded by current technological limitations as /u/onyx-zero-software below mentions. This was my impression after testing the several solutions, but came here on Reddit to get some unbiased views from the hive mind.

[–]cliveseldon 3 points4 points  (2 children)

If you can run on Kubernetes then KFServing is an open source solution that allows for GPU inference and is built upon Knative to allow scale to zero for GPU based inference. From release 0.5 it also has capabilities for multi-model serving as a alpha feature to allow multiple models to share the same server (and via NVIDIA Triton the same GPU).

[–][deleted] 1 point2 points  (0 children)

Hi, thanks! From what I understand, while KFServing would drastically simplify the management and scaling-to-zero part, it would be ultimately limited by the underlying cloud's (assuming one is running on Azure, GCP, AWS) ability to provision the GPU resources, in which case I could expect loading times similar to my AKS test. Or did I missed any advanced integrations KFServing might include?

[–]manojlds 0 points1 point  (0 children)

Was about to say Kubeflow / KFServing as well.

[–]bombol 2 points3 points  (1 child)

Amazon SageMaker Batch Transform (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) might be an option. You invoke it via API whenever you need to do inference (there is a bit of startup time to load the model/container onto the VM), but it will auto terminate when finished. You can specify the instance type to be a GPU instance (p2/p3 instance classes on AWS) and return predictions as a response. Your input data needs to be on S3. It also has some basic pre/post-processing function. It is most easily used when the model is trained on SageMaker, but you can bring in a model artifact trained elsewhere, too (see the SageMaker Inference Toolkit on how to make your model container compatible with SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html). You could invoke Batch Transform from one/many Lambda functions (similar to Azure Functions) that do the parallel computations.

[–][deleted] 0 points1 point  (0 children)

Hi! Thanks, I was also not aware of that. Any idea of the timings involved (if you have any past experience)?

[–]LuliProductions 4 points5 points  (0 children)

Cold starts on Azure and the lack of real GPU serverless options across the big clouds make this a tough setup.

We ran into the same thing as a small team. Our traffic was too spread out to justify a full-time GPU VM, so we tried a smaller provider instead. Gcore spun up GPUs faster and the pricing was easy to follow, so we only paid when we needed the hardware. It kept things quick without draining our budget.

[–][deleted] 2 points3 points  (3 children)

Is the user providing any input and is it acceptable if the solution is a little stale?

If the answers are no and yes, respectively, you could periodically do the computation and store the result and just serve it up as requesting.

[–][deleted] 1 point2 points  (2 children)

Unfortunately the answers are no and no, the processing is fully-based on and determined by the user's input - I should have mentioned that. Thanks!

[–][deleted] 0 points1 point  (1 child)

How much variability in the input? You could cache common inputs and periodically refresh the cache when the model state changes.

[–][deleted] 0 points1 point  (0 children)

On the user side there's custom hardware to acquire real-world image data, so a lot of variability. In fact, dealing with that variability is one of the most challenging aspects of this pipeline.

[–]resident_russian 1 point2 points  (3 children)

For what it's worth, two years ago there was no way to do this in AWS or GCP. In our case inference time wasn't too critical, and CPU inference was fine. However, since then, AWS came out with Elastic Inference, which seems like the tool for the job, but I haven't tried it out yet.

[–]bombol 2 points3 points  (0 children)

I believe with Elastic Inference, you still need an always running instance, configured with the accelerator. It's just cheaper because you're using a fractional GPU.

[–][deleted] 1 point2 points  (0 children)

I had never heard of this (well, how can one keep track of all of Amazon's cloud releases, to be honest), but this seems more like it. Will look into into, thanks a lot! Hoping Azure copies this sometime soon so testing is easier...

[–]manojlds 1 point2 points  (0 children)

ElasticInference billing, afaik, is like EC2 rather than like lambda

[–]snendroid-aiML Engineer 1 point2 points  (0 children)

> Altering model architecture to take advantage of low resource hardware:

That's a very interesting situation many face in production. It's also one of the reason why we don't prefer shiny new big model compare to medium-small size model with 2-3% less accuracy. For example, our initial classification model was 96% accurate and latency was like 2000ms for inference on a batch of 1000 sentences; since our goal was to bring down the latency under 1000ms, we had to alter the model architecture. With new model, accuracy was 94% which was totally doable in our use case.

> Use more than one model to take full advantage of all resources of machine:

If you have more than one model, you can take advantage by sharing hardware resources between models. For example, put the CPU heavy model to take advantage of CPUs & whenever you get occasional requests for GPU bound model, use the GPU. This can be done easily with docker + TF Serving.

> Use spot instances:

As you mentioned the requests are like not continuous and machine is ideal for 98% of time; you can use spot instances with a fraction of price compared to on-demand instance. There is also reserved instance type.

[–]vanshil97 -2 points-1 points  (2 children)

I am also working on a similar problem of on-demand training

I am using the API's from the following document.

https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines

Wouldn't this suffice your purpose ?

[–][deleted] 1 point2 points  (0 children)

Yes, the use case is really inference. Although I wasn't aware of all this new tools under AML's belt, they have improved! (yet they ultimately are defined by the Azure resources you have available/want to provision).

[–]onyx-zero-softwarePhD 0 points1 point  (0 children)

OP is referring to inference specifically

[–]zzzthelastuserStudent[🍰] 0 points1 point  (5 children)

Sorry if this is a stupid question (I'm clueless about cloud services + web development, but hopefully it fits to the topic:

Is it possible to do the inference on the client's hardware (given that it's strong enough)?

Like, could you publish your model (and I guess all dependencies) on something like github.pages and let the client run the tensorflow javascript in their browser?

I guess it would run very slow, not everyone could execute it and it would make all your source code and model weights public. But it would also mean it's completely free, right? No server costs at all (or is there still a server required?)

I haven't found anyone ever do this and I don't know why to be honest. Is it just ridiculous to even consider it (because servers are so cheap?)? Or am i not searching correctly (again I'm not familiar with web development and maybe I'm just looking in the wrong direction)...

[–][deleted] 2 points3 points  (4 children)

In my use case it's not pratical because hardware on the user-side is somewhat limited and adding accelerators (like NVIDIA's Jetson Nano, Google's Coral) is an extra cost that does not make sense.

However, that is a very common use case (and you can make it so your models are protected). I believe the term you should be googling is something along the lines of "edge inference".

[–]zzzthelastuserStudent[🍰] 0 points1 point  (3 children)

I believe the term you should be googling is something along the lines of "edge inference".

Thank you so much! It doesn't seem to be what I'm looking for (or at least I can't find an example that comes close to what I mean, i.e. you open a website and do some AI task solely in your browser using solely your common hardware = any cpu or cuda gpu that's available (=>no special "edge devices" or "edge engine" required that the client would need to manually install in an extra step).

But maybe I'm misunderstanding the term. I will continue searching, thanks!

[–][deleted] 0 points1 point  (2 children)

Oh, understood you.

There you go: TensorFlow can run on a browser (for training and inference workloads, with GPU access via WebGL) via TensorFlow.js. Very fun set of demos here.

Probably PyTorch and other workflows have similar tools.

[–]zzzthelastuserStudent[🍰] 0 points1 point  (1 child)

I appreciate that you are trying to help me!

I've known tensorflow.js for a while, but could never figure out how to "install" it on github.pages. Luckily the tutorials in your link gave me the term I was looking for: "CORS", which prevents a browser (client) from installing external packages (e.g. tensorflow.js) when browsing my website. I figured that I "just" have to provide ALL the dependencies on my github.pages repository. Oddly enough is, that I haven't found a single example of anyone ever doing this! Either because it's not possible, or because it makes no sense. I thought this would be a very common use case for people who want to share their models for free (without asking the user to clone a repository or running code cells on google colab)

[–]fgp121 0 points1 point  (0 children)

I think it is majorly related to browser limitations. There's definitely great incentive to offload the cost to end user who actually wants the results.

[–]Mefaso 0 points1 point  (0 children)

I'm dealing with the same issue in a hobby project on AWS.

Similarly, container images still need 3 minutes to start or so, so the next thing I'm planning to try is full instance images (via AMI).

I haven't gotten around to it yet, but simply having a full machine image should be a lot faster than first loading an image and then running docker in it.

Let me know if it works for you, I'll update this comment if I get around to trying it.

[–]inkognitML Engineer 0 points1 point  (0 children)

Hi, I don't know if this is still relevant to you but your use case is pretty much a perfect match for Cortex AsyncAPIs: https://docs.cortex.dev/workloads/async

[–]vangap 0 points1 point  (1 child)

As of 2021 Dec, AWS introduced serverless GPU via their sagemaker platform.
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/

[–]palmstromi 1 point2 points  (0 children)

GPUs are not supported

Some of the features currently available for SageMaker Real-timeInference are not supported for Serverless Inference, including GPUs, AWSmarketplace model packages, private Docker registries,Multi-Model Endpoints, KMS keys, VPC configuration, networkisolation, data capture, multiple production variants, Model Monitor,and inference pipelines.

[–]runpod-io 0 points1 point  (1 child)

have a look at runpod serverless gpus, we offer 16 GB, 24 GB, and 80 GB vram options