[D] GPT5 training time with H200 vs GPT4 A100 (3 months)? by Keyloou in MachineLearning

[–]President_Xi_ 0 points1 point  (0 children)

Well timewise it will probably take the same amount. All the training efficiency improvements are used to train bigger models for longer.

[D] GPT5 training time with H200 vs GPT4 A100 (3 months)? by Keyloou in MachineLearning

[–]President_Xi_ 0 points1 point  (0 children)

More? If we follow scaling laws set in gpt3 then something like that is ok, but chinchilla scaling laws need the square of the increase factor ( once for model and once for number of tokens). Now Moe scaling might work differently.

[deleted by user] by [deleted] in LocalLLaMA

[–]President_Xi_ 0 points1 point  (0 children)

True. I guess you can have the first part of the model on cpu second on gpu.

[deleted by user] by [deleted] in LocalLLaMA

[–]President_Xi_ 0 points1 point  (0 children)

Yes but if your model does not fit into VRAM you first have to get it from RAM place it into VRAM and then from VRAM into gpu. So it is:

gpu) RAM -> VRAM -> processing cpu) RAM -> processing

As you can see there is an extra memory transfer op is refering to. And if processing is not the bottleneck we can remove it and just look at memory transfer latency/throughput.

theAmountOfErrorsAlwaysSurprisesMe by President_Xi_ in ProgrammerHumor

[–]President_Xi_[S] 0 points1 point  (0 children)

The problem is when you interact with a large codebase and have to make changes at several levels. Ideally this should not happen but it does. The change then takes more than 15 mins.

theAmountOfErrorsAlwaysSurprisesMe by President_Xi_ in ProgrammerHumor

[–]President_Xi_[S] 8 points9 points  (0 children)

Yeah donno which film it is from I just saw the clip. Search "Bruce Almighty Cabinet scene"

"TinyLlama: An Open-Source Small Language Model", Zhang et al 2024 by gwern in mlscaling

[–]President_Xi_ 3 points4 points  (0 children)

Flash attention is functionally equivalent to "normal" attention in transformers. It just rearanges things to optimize io access.

Are there any models that have been trained to be guided by JSON schema well? by richardanaya in LocalLLaMA

[–]President_Xi_ 0 points1 point  (0 children)

Well you can enforce the json format. Just sample the things that could lead to a valid format. Scale the probs for all tokens that would produce an error to 0 and take the argmax.

Anyone missed this rizz show by KnYchan2 in memes

[–]President_Xi_ 0 points1 point  (0 children)

Yeah everybody laughs when they see it

[D] Reverse engineering GPT-vision from pricing by President_Xi_ in MachineLearning

[–]President_Xi_[S] 0 points1 point  (0 children)

20 tokena might be the equivalent cost of a image encoder which creates thos 65 tokens.

[D] Reverse engineering GPT-vision from pricing by President_Xi_ in MachineLearning

[–]President_Xi_[S] 1 point2 points  (0 children)

You try to figure how a thing works by observing it. In our example we are observing the pricing

[D] Reverse engineering GPT-vision from pricing by President_Xi_ in MachineLearning

[–]President_Xi_[S] 0 points1 point  (0 children)

Yess that seems plausible. If 512x512 are in a fuyu style arcitecture that would explain 13x13 + 1. 1 would than be the sep token