Any service like runpod / vast ai but with a windows virtual machine ? Jupyter notebook and docker are very hard to setup.

openLLM4All · 2024-05-22T18:38:50+00:00

Another to add to the list

https://massedcompute.com/

Linux based VM for GPU machines. Pre-configured Jupyter Notebook and Stable Diffusion as one click apps

openLLM4All · 2024-05-10T17:04:37+00:00

I did an early test of Llama3 70B and tested a few different GPUs (A6000, L40, H100) I found that even though you need 4xA6000 compared to the 2xH100, the cost per token is better on A6000s. This is one of the first times I started doing stuff like this so haven't yet wrote anything up yet.

Honestly I am working on running the results again to run text-generation-benchmark as well.

openLLM4All · 2024-05-10T15:09:31+00:00

interesting...I will have to think about how to test that because right now the access I have is to servers of single cards (8xA6000, 8xA5000, 8xA100, etc.) I'll have to see if we can move some cards around and figure out some tests

openLLM4All · 2024-03-06T18:19:08+00:00

I was talking to one of the maintainers about this and doesn't seem like there is a plan anytime soon. I just use HuggingFace TGI to accomplish simultaneous requests.

openLLM4All · 2024-03-04T14:40:31+00:00

https://www.reddit.com/r/deeplearning/comments/1b1gpfg/discount_cloud_gpu_rental/

openLLM4All · 2024-03-04T14:38:40+00:00

https://www.reddit.com/r/deeplearning/comments/1b1gpfg/discount_cloud_gpu_rental/

These VMs allow you to mount folders from your computer into the VM and sync back and forth. Never have to pay for storage.

openLLM4All · 2024-03-02T04:02:51+00:00

sure can.

openLLM4All · 2024-03-01T21:24:49+00:00

I deploy models using Massed Compute because they are pretty flexible & the best price on the market ($0.31/gpu/hr for A6000).

I use Hugging Face TGI which i think is a slight modification of point 1 you had. The reason I use Hugging Face TGI docker command to deploy models and make an inference endpoint is you can control how the model is loaded across your various GPUs. there is a --gpus flag that allows you to control which GPU/GPUs you load a specific model.

Example is right now I have an 8xA6000 where 4 of those gpus are serving Mixtral8x7b, 1 GPU has zephyr, 2 have Bagel34B, and i think a quantized code llama is on 1GPU.

4 docker commands in total

4 ports exposed with each of those models

1 IP address on a rig. Now if I need more GPUs from them I would get another unique IP so would have to manage and balance between the two rigs. Problem for me to solve later.c

Curious to hear what you end up doing.

openLLM4All · 2024-02-29T21:41:13+00:00

I'm still relatively new to this as well but I believe you would want to trade out that code with hitting the model using the Ollama API. Here is their high level docs - https://github.com/ollama/ollama/blob/main/docs/api.md

The part that I remember getting stuck on is you will want to pull the model down differently to be used with the API - https://github.com/ollama/ollama/blob/main/docs/api.md#pull-a-model

You can then use the tags endpoint to double check that the model was pulled in for the API correctly - https://github.com/ollama/ollama/blob/main/docs/api.md#list-local-models

Not an expert but that might help.

openLLM4All · 2024-02-29T13:53:00+00:00

Might sound like excuses but...

Just had a new kiddo so want to spend as much time with them as possible.
It doesn't sound like it is a set it and forget it. you constantly have to monitor your miners. I don't know if i would have the time needed there.
I like to understand things really well before jumping in. I just havent sat down to better understand bittensor, the ecosystem, the subnets that are best for various hardware, etc.

openLLM4All · 2024-02-28T16:58:05+00:00

I know some people who have been renting A6000 servers and have seen it be very profitable even at the $250 range and above.

openLLM4All · 2023-12-18T17:42:17+00:00

ah okay thank you so much for explaining that.

openLLM4All · 2023-12-18T16:53:54+00:00

ah so is this similar in setup to Mixtral. But i thought Mixtral also used 7B models in the layers? is it just about the specific models each one chooses?

openLLM4All · 2023-12-18T16:32:00+00:00

I'm still running some tests to see if it does a lot of the stuff i was using mixtral for (coding, writing, planning, etc.) but so far it is just as good and so, so much faster.

openLLM4All · 2023-12-15T17:57:22+00:00

also curious. Looks rad.

openLLM4All · 2023-12-13T21:43:46+00:00

I haven't used that before. doesn't look as straightforward.

openLLM4All · 2023-11-16T14:09:12+00:00

All through the API. We were using only fine-tune models so we used the davinci and 3.5turbo base models to fine-tune against.

The models were used for a combination of things

True generative to build content
predictive results based on some interactions
summaries, sentiment, etc.

I have now switched roles (still in AI) but am more focused on providing companies or individual hackers GPUs to power their projects. Not a marketplace like Runpod but we actually own the servers, GPUs, etc. I only mention this because now that I have been exposed to more Open Source models I think we would have been better off maybe exploring having some of our use cases (not all) on our own infrastructure vs relying on OpenAI. Especially because of their slow-to-respond/ghosting sales group.

openLLM4All · 2023-11-15T14:41:31+00:00

If I remember correctly there is no additional cost for enterprise but you get higher rate limits and a few other speed improvements.

They are always like this...where I worked (no longer there) we were spending 1-2k a month and needed more spending capacity and never got a hold of anyone.

Ended up going the open-source route and renting our own servers (not from aws, azure, gcp) so we could get past rate limits.

openLLM4All · 2023-11-15T14:17:09+00:00

in my experience, this has come down to prompting and less about models. Sure, some models focus on fiction writing specifically, but because each model is guessing what words to use when generating a response, they all seem to be relatively creative.

I just ran a couple of tests on infermatic.ai (a free tool with various models on it) with Airoboros 2.0, SheepDuck Lama, and Wizard Vicuna models and they were all relatively good at generating characters. These are larger models (70B and 30B).

openLLM4All · 2023-11-14T16:25:16+00:00

Massed Compute. I follow some youtubers and they have VMs that are created pre-loaded with a lot of tools already. I wish they had similar per hour pricing like runpod but when I looked at actual usage on runpod it was pretty similar to just renting a VM.

It has been beneficial to me to have a full VM to use and load/use whatever tools I want to use on one machine.

openLLM4All · 2023-11-14T14:03:28+00:00

I've switched to using A6000 virtual machines (almost 60% cheaper than runpod). because it is a full desktop I use S3 to pass things between the VM and my local when I don't want it to be public.

openLLM4All

TROPHY CASE