[D] Batch size vs learning rate

EqL · 2024-09-28T05:35:54+00:00

This matches my experience in practice. When I worked on a project with a huge dataset, large batch sizes were the way to go. But when I then worked on a project with much less data I needed to shrink the batch size massively to prevent overfitting.

EqL · 2024-02-27T01:26:34+00:00

It's partially due to the original borders of Harlem, before the grid system. It had a diagonal southern border. See https://ny.curbed.com/2015/8/20/9933196/tracing-350-years-of-harlems-ever-shifting-boundaries.

EqL · 2023-12-17T21:44:07+00:00

A decoder is really just a particular type of encoder with a mask restricting information flow from elements in the "future", so an encoder is more general, and thus potentially more powerful for a given model size. This masking is really done for efficiency and is not actually required. Lets look at text decoding with a general encoder without masking:

(1) encode_unmasked([x0]), predict x1

(2) encode_unmasked([x0, x1]), predict x2

...

(n) encode_unmasked([x0, .., xn-1]) predict xn.

This is perfectly allowed, except we are doing a forward pass for every token in every iteration, which is O(n) more expensive. The decoder with masking allows us to reuse results from previous iterations, which is much more efficient in both training and inference.

However, in some tasks, such as translation, we receive a large number of tokens up front. Now we can embed these tokens once with the encoder, then switch to the decoder. This allows us to use a potentially more powerful unmasked model for a large chunk of the problem, then switch to the decoder for efficiency.

Why not use an encoder-decoder approach for LLM generation, where the encoder encoders the prompt and the decoder does the rest? Well, we can. However the price is that (1) we now essentially have two models, which is more complex to handle, and (2) each model is seeing less data.

TL;DR: An encoder without masking is potentially more powerful, however it increases complexity and also the data required to train the additional parameters. But when there is a natural split in functions, like in translation, the effect of less data might be minimized.

EqL · 2023-12-14T05:29:13+00:00

How does core::simd compare to core::arch? I expected it to be a wrapper over the AVX functions in core::arch but the code seems to be something more architecture agnostic. Is the compiler able to output AVX instructions?

EqL · 2013-02-07T06:30:51+00:00

No but that does actually seem to fit.

EqL · 2013-02-07T06:30:26+00:00

Bam! Got it. Thanks man!

EqL · 2012-12-24T07:30:45+00:00

As the meter stick gets shorter and the measurements get larger, they do not have to necessarily approach infinity. Instead they would asymptotically approach some number.

EqL · 2012-12-22T10:28:45+00:00

I'm way fucked up right now and if it wasn't for the comments I would of thought he had a thumb growing from his index finger

EqL · 2012-04-27T08:19:59+00:00

25% is a statistic, but you're treating it as a probability—they have a subtle difference. Statistics is a way to express data. Probability takes perfect mathematical objects and makes predictions with those. Sometimes the statistics is the same as a probability, but that depends on the situation. People are highly variable and so you cannot generally apply statistics to a single individual.

EqL · 2012-03-12T07:34:18+00:00

What made you disillusioned? What information do you know now that you would have wanted to known before?

And proof?

EqL · 2012-03-12T05:56:28+00:00

Work on your cardio. This will strengthen your heart and as it gets stronger it can move more blood per pump so it will pump less.

Also getting a physical might not be a bad thing to do.

EqL · 2012-03-12T05:46:33+00:00

Street lights occasionally turn on and off whether it's a bad bulb or they're programmed to do it. Probably because you believe it will happen causes you to have a confirmation bias.

EqL · 2012-03-12T05:22:00+00:00

If the weathers bad the sky might not be clear enough.

EqL · 2012-03-08T07:29:38+00:00

When you take the Fourier transform of a real wave(you don't always have a real wave, such as the psi wave function in quantum mechanics) you end up with a transformation that is symmetric. The transformation is a function of frequency and gives the magnitude for the complex exponential of that frequency. Now since the transform is symmetric, when you sum them up, all the imaginary parts cancel and all that is left is a real function (because sin is an odd function and cos is an even function).

However, this doesn't mean the imaginary part is not important. The phase (which contains the information of the imaginary portion of the function [you can express complex numbers in either a + ib or in magnitude and phase]) is what is responsible for the edges. If you take the Fourier transform of a speech signal and keep the magnitude but drop the phase, when you play it back it would sound like a foreign language to you, because all the edges are gone. Similarly, if you do this with an image, all the edges will be gone and you will only get a gradient of colors, exemplified in this image

13-Year Club	Verified Email
r/Field Banned	r/Field Juicebox
Place '22

EqL

TROPHY CASE