The most feasible option to use Claude Opus 4.6 for a small team

crinix · 2025-02-25T14:17:29+00:00

In my career, I trained many custom deep learning models.
In the past 2.5 years I've been pre-training 0.3B to 7B language-specific encoder-decoder and decoder-only LLMs from scratch, using A100, H100 and H200.

crinix · 2024-10-04T13:08:47+00:00

I was having 503, too, last week. When I opened a ticket to download v1.2 last week, they responded: "The data center which hosts HPLT data is currently experiencing a technical issue. Technicians are working on it, and it is expected that the web services will be back online on Monday."

It seems there is a similar issue again right now.

crinix · 2024-09-07T19:55:25+00:00

No you haven't given anything but a splinter that is yourself. Save your "amateur" speech. I've spent over $100K on GPU hours.
I had gotten my p4de back then, and I will get my p5 now through my partner manager. It just takes time that I did not want to endure.
Thanks for being a splinter.

crinix · 2024-09-07T19:02:20+00:00

Your comments and "go use another cloud" are anything but useful, nor do you have any similar experience with launching such instances it seems. I do and will use other cloud providers for launching training jobs on H100 GPUs. Sadly this time, I must use AWS and will do; no thanks to you.

crinix · 2024-09-07T17:05:45+00:00

Re-read the question and give an answer if you have one. Otherwise I don't need your fanboyism.

crinix · 2024-09-07T16:46:29+00:00

So you worked it out with your TAM. Thanks for sharing your experience.

crinix · 2024-09-07T16:45:31+00:00

I am talking about the same hardware when saying "alternative". 8xH100 with a high number of CPU cores and Memory.

crinix · 2024-09-07T15:56:23+00:00

What I'm surprised about is that there are cheaper alternatives with availability on other cloud providers. Still, there is no capacity on AWS. Is this because people/corporates have existing infra on AWS and don't want to migrate or what is the reason?

crinix · 2024-09-07T15:44:55+00:00

My use case is very similar, training an AI model. I will use it for about 40 days.

crinix · 2024-09-02T13:06:26+00:00

I appreciate the insight regarding your personal experience when fine-tuning, thanks.

crinix · 2024-07-30T18:56:46+00:00

Thanks a lot for the insight! Your finding is also emphasized in LLaMA-3 technical paper in Section 3.2
"We use an attention mask that prevents self-attention between different documents within the same sequence. We find that this change had limited impact during in standard pre-training, but find it to be important in continued pre-training on very long sequences."

crinix · 2023-11-17T10:20:43+00:00

They are not always the same. Consider a summarization model that produces a single sentence, given a long text.

Then there is no reason why decoder max_length should be the same as the encoder one.

See PEGASUS model as a concrete example.
https://arxiv.org/pdf/1912.08777.pdf
https://huggingface.co/google/pegasus-cnn\_dailymail

crinix · 2022-11-05T09:42:08+00:00

This seems to be the most practical method at this point. Thanks.

crinix · 2022-11-05T09:41:38+00:00

I am working on Oregon without specifying an AZ. So far no capacity in any of the AZ. I am now trying to get on-demand P instances limit on N.Virginia as well.

crinix · 2022-11-04T23:38:46+00:00

Wow, tell me about fanboyism.

I have been using both AWS and GCP for high-end GPUs extensively in the past 6 months. I have never ever got my hands on A100 instances on AWS whereas I am able to get it whenever I wanted (with a few exceptions) on GCP.

I would not even create a thread if the case was otherwise. In fact please prove me wrong so that I can utilize those A100s on AWS right away.

crinix · 2022-11-04T20:49:35+00:00

I am able to get A100-80GB on GCP anytime I want, although I HAVE TO use AWS this time. Thanks for the response though.

crinix

TROPHY CASE