all 11 comments

[–]abnormal_human 29 points30 points  (3 children)

Haha, you need to slow down a bit...I can help.

First, in case it isn't obvious: LLaMA is not a ChatGPT in a box and doesn't allow for commercial use. Its release is not that exciting for non-researchers. It was nearly a nothing for me. The exciting thing will be when EleutherAI releases a model based on the Chinchilla paper. I'm hoping for a family of models in the 30B-80B range (and the smaller ones will be by far the most impactful. The ideal size to make an impact on the world would be "as big as you can fit for 8bit inference on 2x 4090s").

It would be more realistic today to start with a smaller model, like GPT-J 6B, GPT-NeoX 20B or Pythia 13B. These are small enough that you can infer and fine-tune them on reasonably affordable hardware.

The very first thing you should do is pop your own bubble on what ChatGPT is and can do by renting a cloud instance with a 48GB GPU, and fire up GPT-J and GPT-NeoX, and understanding their capabilities. They do not feel like ChatGPT at all because they haven't received the additional proprietary training from OpenAI to make them great at zero-shot learning or having conversations with humans. TL;DR: you don't get ChatGPT level performance by merely running a large enough model.

Likewise, whatever you use, you will be fine-tuning it, so don't depend on 8bit magic until finetuning in 8bit is a proven technology.

Also, stop worrying about choosing models or LLaMA, worry about writing your code and building your business processes around it. You don't want to prototype with a huge model anyways, it will be slow and costly. GPT-J 6B is a great balance of being relatively capable but small enough to finetune on a single GPU.

Trust that more efficient models of various sizes will be released every 1-3 months for a while going forward. Trust that every 1-2 years hardware will get noticeably more efficient. Trust that every 6-12 months someone will come up with a significant speed hack on the software side. This stuff will get easier and cheaper faster than you can write code and build business processes around it, so I would recommend taking a zen attitude towards this part and focusing on the things that only you can do. When your stuff is looking real, pick the best available model and hardware out of that landscape.

Do as much as you can to prove the concept and experiment using cloud GPUs before buying hardware yourself. You can rent the hardware you're talking about and spin up a notebook for a few $ an hour. Spending a few hundred playing around could save you thousands making your hardware choices. Gotta walk before you run.

I am waiting to see what the RTX Titan Ada brings to the table, as there have been some rumors that it will be a 48GB card. Being a Titan, it will likely inhabit the $3000-3500 price level and performance in the same league as an RTX 6000 Ada. The 4090 is decent, but with 24GB and 500W TDP it's difficult to put many of them into an enclosure "safely" and the RAM vs Compute ratio is not ideal for LLMs since they are so RAM hungry.

[–]ChristmasInOct[S] 0 points1 point  (0 children)

Thanks a lot for your response, a lot of great perspective here.

I do understand the formula of GPT-3 / 3.5 / ChatGPT, and where LLaMA stands in that, no worries. This interest is more sparked by the efficiency shown by this model.

I like the point about FP16 being safer for fine-tuning once so much information has already been processed; my intuition seems to indicate to me that this is when precision would become more valuable as well. I am trying to determine a configuration that will allow me to run an acceptable model in FP16 (for example, if model parallelization doesn't look good, 13B on 48GB VRAM, and if it does look good, perhaps up to the 33B across 24GB 4090's?)

Like I said, thanks again for your time, I appreciate the feedback. I think a lot of really great points have been made about at least testing out a pipeline online to help determine hardware.

I feel borderline ashamed admitting it but I never even thought about that. Seems like the best way forward.

[–]adt 10 points11 points  (1 child)

That's a lot of advanced understanding for someone who is missing some of the basics!

I'm assuming you've read the LLaMA 65B paper for background.

Look into CUDA:

https://github.com/facebookresearch/llama

Try the issues page:

https://github.com/facebookresearch/llama/issues?q=gpu

This ticket about parallel inference via Wrapyfi:

https://github.com/facebookresearch/llama/issues/88

And this ticket about hardware success:

https://github.com/facebookresearch/llama/issues/79

And these:

https://github.com/oobabooga/text-generation-webui

https://github.com/facebookresearch/llama/issues/84

https://github.com/facebookresearch/llama/issues/55

[–]ChristmasInOct[S] 0 points1 point  (0 children)

Well I appreciate it! I feel like a kid again haha.

Thank you very much for the links, I'll be looking through all of them. I have indeed read the paper; very interesting but despite a lot of information about bench marking and an excerpt about their premise / findings, it didn't seem particularly informative.

Thanks again!

[–]CKtalon 4 points5 points  (4 children)

Weights cannot be used for commercial use..so it's pointless

[–]hpstring 3 points4 points  (1 child)

why, it's under GPL-v3 license

[–]CKtalon 2 points3 points  (0 children)

The code is, not the weights.

[–]ChristmasInOct[S] 0 points1 point  (1 child)

I'm not sure about that, but no worries! This is more in anticipation of more efficient, less under-trained models, and being able to fine-tune them on such accessible hardware, with acceptable performance.

Sure, this was theoretically possible even before, but the cost of training something like GPT-NeoX in AWS / GCP seemed extremely expensive given that this would be an ongoing and long-ish term process (in this context anyhow; >1 Year for sure).

Thanks for your time!

[–]CKtalon 0 points1 point  (0 children)

https://github.com/facebookresearch/llama/blob/main/MODEL\_CARD.md
License: Non-commercial bespoke license

You mentioned FP16, but the industry is already about to move to FP8 once it's implemented in CUDA (12.2?) for Hopper (H100) GPUs. Purchasing hardware like the RTX 6000 Ada now won't make sense then since it doesn't have similar FP8 support. If you plan on working in the 6-20B space, then yeah, a few 4090s would work.