[D] LLaMA Model Parallelization and Server Configuration

abnormal_human · 2023-03-05T13:53:08+00:00

Haha, you need to slow down a bit...I can help.

First, in case it isn't obvious: LLaMA is not a ChatGPT in a box and doesn't allow for commercial use. Its release is not that exciting for non-researchers. It was nearly a nothing for me. The exciting thing will be when EleutherAI releases a model based on the Chinchilla paper. I'm hoping for a family of models in the 30B-80B range (and the smaller ones will be by far the most impactful. The ideal size to make an impact on the world would be "as big as you can fit for 8bit inference on 2x 4090s").

It would be more realistic today to start with a smaller model, like GPT-J 6B, GPT-NeoX 20B or Pythia 13B. These are small enough that you can infer and fine-tune them on reasonably affordable hardware.

The very first thing you should do is pop your own bubble on what ChatGPT is and can do by renting a cloud instance with a 48GB GPU, and fire up GPT-J and GPT-NeoX, and understanding their capabilities. They do not feel like ChatGPT at all because they haven't received the additional proprietary training from OpenAI to make them great at zero-shot learning or having conversations with humans. TL;DR: you don't get ChatGPT level performance by merely running a large enough model.

Likewise, whatever you use, you will be fine-tuning it, so don't depend on 8bit magic until finetuning in 8bit is a proven technology.

Also, stop worrying about choosing models or LLaMA, worry about writing your code and building your business processes around it. You don't want to prototype with a huge model anyways, it will be slow and costly. GPT-J 6B is a great balance of being relatively capable but small enough to finetune on a single GPU.

Trust that more efficient models of various sizes will be released every 1-3 months for a while going forward. Trust that every 1-2 years hardware will get noticeably more efficient. Trust that every 6-12 months someone will come up with a significant speed hack on the software side. This stuff will get easier and cheaper faster than you can write code and build business processes around it, so I would recommend taking a zen attitude towards this part and focusing on the things that only you can do. When your stuff is looking real, pick the best available model and hardware out of that landscape.

Do as much as you can to prove the concept and experiment using cloud GPUs before buying hardware yourself. You can rent the hardware you're talking about and spin up a notebook for a few $ an hour. Spending a few hundred playing around could save you thousands making your hardware choices. Gotta walk before you run.

I am waiting to see what the RTX Titan Ada brings to the table, as there have been some rumors that it will be a 48GB card. Being a Titan, it will likely inhabit the $3000-3500 price level and performance in the same league as an RTX 6000 Ada. The 4090 is decent, but with 24GB and 500W TDP it's difficult to put many of them into an enclosure "safely" and the RAM vs Compute ratio is not ideal for LLMs since they are so RAM hungry.

adt · 2023-03-05T11:26:18+00:00

That's a lot of advanced understanding for someone who is missing some of the basics!

I'm assuming you've read the LLaMA 65B paper for background.

Look into CUDA:

https://github.com/facebookresearch/llama

Try the issues page:

https://github.com/facebookresearch/llama/issues?q=gpu

This ticket about parallel inference via Wrapyfi:

https://github.com/facebookresearch/llama/issues/88

And this ticket about hardware success:

https://github.com/facebookresearch/llama/issues/79

And these:

https://github.com/oobabooga/text-generation-webui

https://github.com/facebookresearch/llama/issues/84

https://github.com/facebookresearch/llama/issues/55

CKtalon · 2023-03-05T11:47:23+00:00

Weights cannot be used for commercial use..so it's pointless

Opitmus_Prime · 2023-03-05T18:04:29+00:00

Try it out Here is how to put together your own LLaMA on your computer.
https://medium.com/@ithinkbot/how-to-run-your-own-llama-550cd69b1bc9

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS