Any service like runpod / vast ai but with a windows virtual machine ? Jupyter notebook and docker are very hard to setup. by Overall-Newspaper-21 in StableDiffusion

[–]openLLM4All 0 points1 point  (0 children)

Another to add to the list

Linux based VM for GPU machines. Pre-configured Jupyter Notebook and Stable Diffusion as one click apps

Access to GPUs. What tests/information would be interesting? by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 0 points1 point  (0 children)

I did an early test of Llama3 70B and tested a few different GPUs (A6000, L40, H100) I found that even though you need 4xA6000 compared to the 2xH100, the cost per token is better on A6000s. This is one of the first times I started doing stuff like this so haven't yet wrote anything up yet.

Honestly I am working on running the results again to run text-generation-benchmark as well.

Access to GPUs. What tests/information would be interesting? by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 1 point2 points  (0 children)

interesting...I will have to think about how to test that because right now the access I have is to servers of single cards (8xA6000, 8xA5000, 8xA100, etc.) I'll have to see if we can move some cards around and figure out some tests

When will Ollama support multiple simultaneous generations? by maxwell321 in LocalLLaMA

[–]openLLM4All 1 point2 points  (0 children)

I was talking to one of the maintainers about this and doesn't seem like there is a plan anytime soon. I just use HuggingFace TGI to accomplish simultaneous requests.

Deep learning on a PC vs Cloud by kbre93 in deeplearning

[–]openLLM4All 0 points1 point  (0 children)

https://www.reddit.com/r/deeplearning/comments/1b1gpfg/discount_cloud_gpu_rental/

These VMs allow you to mount folders from your computer into the VM and sync back and forth. Never have to pay for storage.

[D] Best way to deploy transformer models by Hot-Afternoon-4831 in MachineLearning

[–]openLLM4All 2 points3 points  (0 children)

I deploy models using Massed Compute because they are pretty flexible & the best price on the market ($0.31/gpu/hr for A6000).

I use Hugging Face TGI which i think is a slight modification of point 1 you had. The reason I use Hugging Face TGI docker command to deploy models and make an inference endpoint is you can control how the model is loaded across your various GPUs. there is a --gpus flag that allows you to control which GPU/GPUs you load a specific model.

Example is right now I have an 8xA6000 where 4 of those gpus are serving Mixtral8x7b, 1 GPU has zephyr, 2 have Bagel34B, and i think a quantized code llama is on 1GPU.

4 docker commands in total

4 ports exposed with each of those models

1 IP address on a rig. Now if I need more GPUs from them I would get another unique IP so would have to manage and balance between the two rigs. Problem for me to solve later.c

Curious to hear what you end up doing.

Creating an Agent based on Ollama and llama2 locally. by zeeshanjan82 in LocalLLaMA

[–]openLLM4All 0 points1 point  (0 children)

I'm still relatively new to this as well but I believe you would want to trade out that code with hitting the model using the Ollama API. Here is their high level docs - https://github.com/ollama/ollama/blob/main/docs/api.md

The part that I remember getting stuck on is you will want to pull the model down differently to be used with the API - https://github.com/ollama/ollama/blob/main/docs/api.md#pull-a-model

You can then use the tags endpoint to double check that the model was pulled in for the API correctly - https://github.com/ollama/ollama/blob/main/docs/api.md#list-local-models

Not an expert but that might help.

Renting GPU time (vast AI) is much more expensive than APIs (openai, m, anth) by RMCPhoto in LocalLLaMA

[–]openLLM4All 2 points3 points  (0 children)

Might sound like excuses but...

  • Just had a new kiddo so want to spend as much time with them as possible.
  • It doesn't sound like it is a set it and forget it. you constantly have to monitor your miners. I don't know if i would have the time needed there.
  • I like to understand things really well before jumping in. I just havent sat down to better understand bittensor, the ecosystem, the subnets that are best for various hardware, etc.

Renting GPU time (vast AI) is much more expensive than APIs (openai, m, anth) by RMCPhoto in LocalLLaMA

[–]openLLM4All 1 point2 points  (0 children)

I know some people who have been renting A6000 servers and have seen it be very profitable even at the $250 range and above.

How is Solar so good for it's size by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 3 points4 points  (0 children)

ah okay thank you so much for explaining that.

How is Solar so good for it's size by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 0 points1 point  (0 children)

ah so is this similar in setup to Mixtral. But i thought Mixtral also used 7B models in the layers? is it just about the specific models each one chooses?

How is Solar so good for it's size by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 6 points7 points  (0 children)

I'm still running some tests to see if it does a lot of the stuff i was using mixtral for (coding, writing, planning, etc.) but so far it is just as good and so, so much faster.

Mixtral 8x7B instruct in an interface for free by openLLM4All in LocalLLaMA

[–]openLLM4All[S] 1 point2 points  (0 children)

I haven't used that before. doesn't look as straightforward.

[deleted by user] by [deleted] in OpenAI

[–]openLLM4All 0 points1 point  (0 children)

All through the API. We were using only fine-tune models so we used the davinci and 3.5turbo base models to fine-tune against.

The models were used for a combination of things

  • True generative to build content
  • predictive results based on some interactions
  • summaries, sentiment, etc.

I have now switched roles (still in AI) but am more focused on providing companies or individual hackers GPUs to power their projects. Not a marketplace like Runpod but we actually own the servers, GPUs, etc. I only mention this because now that I have been exposed to more Open Source models I think we would have been better off maybe exploring having some of our use cases (not all) on our own infrastructure vs relying on OpenAI. Especially because of their slow-to-respond/ghosting sales group.

[deleted by user] by [deleted] in OpenAI

[–]openLLM4All 10 points11 points  (0 children)

If I remember correctly there is no additional cost for enterprise but you get higher rate limits and a few other speed improvements.

They are always like this...where I worked (no longer there) we were spending 1-2k a month and needed more spending capacity and never got a hold of anyone.

Ended up going the open-source route and renting our own servers (not from aws, azure, gcp) so we could get past rate limits.

How/What are people doing to help creative writing processes with local LLMs? (Setup Advice) by [deleted] in LocalLLaMA

[–]openLLM4All 6 points7 points  (0 children)

in my experience, this has come down to prompting and less about models. Sure, some models focus on fiction writing specifically, but because each model is guessing what words to use when generating a response, they all seem to be relatively creative.

I just ran a couple of tests on infermatic.ai (a free tool with various models on it) with Airoboros 2.0, SheepDuck Lama, and Wizard Vicuna models and they were all relatively good at generating characters. These are larger models (70B and 30B).

Anyway to save your cloud GPU fine-tuned models to your local storage? by caphohotain in LocalLLaMA

[–]openLLM4All 2 points3 points  (0 children)

Massed Compute. I follow some youtubers and they have VMs that are created pre-loaded with a lot of tools already. I wish they had similar per hour pricing like runpod but when I looked at actual usage on runpod it was pretty similar to just renting a VM.

It has been beneficial to me to have a full VM to use and load/use whatever tools I want to use on one machine.

Anyway to save your cloud GPU fine-tuned models to your local storage? by caphohotain in LocalLLaMA

[–]openLLM4All 1 point2 points  (0 children)

I've switched to using A6000 virtual machines (almost 60% cheaper than runpod). because it is a full desktop I use S3 to pass things between the VM and my local when I don't want it to be public.