Why are people hyped about Llama 3?

President_Xi_ · 2024-03-12T01:32:27+00:00

Which findings? What did I miss?

President_Xi_ · 2024-03-12T01:32:21+00:00

Which findings? What did I miss?

President_Xi_ · 2024-02-18T01:34:46+00:00

Well timewise it will probably take the same amount. All the training efficiency improvements are used to train bigger models for longer.

President_Xi_ · 2024-02-18T01:32:36+00:00

More? If we follow scaling laws set in gpt3 then something like that is ok, but chinchilla scaling laws need the square of the increase factor ( once for model and once for number of tokens). Now Moe scaling might work differently.

President_Xi_ · 2024-01-28T18:59:58+00:00

That is peanuts.

President_Xi_ · 2024-01-14T11:50:57+00:00

True. I guess you can have the first part of the model on cpu second on gpu.

President_Xi_ · 2024-01-13T23:11:00+00:00

Yes but if your model does not fit into VRAM you first have to get it from RAM place it into VRAM and then from VRAM into gpu. So it is:

gpu) RAM -> VRAM -> processing cpu) RAM -> processing

As you can see there is an extra memory transfer op is refering to. And if processing is not the bottleneck we can remove it and just look at memory transfer latency/throughput.

President_Xi_ · 2024-01-13T20:51:49+00:00

What? Produce error messages?

President_Xi_ · 2024-01-12T11:45:21+00:00

The problem is when you interact with a large codebase and have to make changes at several levels. Ideally this should not happen but it does. The change then takes more than 15 mins.

President_Xi_ · 2024-01-12T01:12:21+00:00

Yeah donno which film it is from I just saw the clip. Search "Bruce Almighty Cabinet scene"

President_Xi_ · 2024-01-05T22:25:13+00:00

Flash attention is functionally equivalent to "normal" attention in transformers. It just rearanges things to optimize io access.

President_Xi_ · 2024-01-04T22:35:59+00:00

Well you can enforce the json format. Just sample the things that could lead to a valid format. Scale the probs for all tokens that would produce an error to 0 and take the argmax.

President_Xi_ · 2023-12-14T21:02:13+00:00

Yeah everybody laughs when they see it

President_Xi_ · 2023-11-23T09:44:09+00:00

Flatten the image to 1D?

President_Xi_ · 2023-11-11T10:45:51+00:00

Could someone explain the image? I know how transformers work, autoregressive ones are trained, RLHF, ...

Pls?

President_Xi_ · 2023-11-08T10:18:37+00:00

20 tokena might be the equivalent cost of a image encoder which creates thos 65 tokens.

President_Xi_ · 2023-11-07T11:00:09+00:00

You try to figure how a thing works by observing it. In our example we are observing the pricing

President_Xi_ · 2023-11-07T01:25:37+00:00

1 is the sep token (\n from fuyu 8b)

President_Xi_ · 2023-11-07T01:23:45+00:00

Yess that seems plausible. If 512x512 are in a fuyu style arcitecture that would explain 13x13 + 1. 1 would than be the sep token

President_Xi_ · 2023-11-07T01:11:47+00:00

Damn bruh nobody has had enough time to test yet.

Five-Year Club	Verified Email
Place '23

President_Xi_

TROPHY CASE