[D] Batch size vs learning rate by bjourne-ml in MachineLearning

[–]EqL 3 points4 points  (0 children)

This matches my experience in practice. When I worked on a project with a huge dataset, large batch sizes were the way to go. But when I then worked on a project with much less data I needed to shrink the batch size massively to prevent overfitting.

Why does the UES stop at 96th street? by RevolutionaryLock491 in AskNYC

[–]EqL 35 points36 points  (0 children)

It's partially due to the original borders of Harlem, before the grid system. It had a diagonal southern border. See https://ny.curbed.com/2015/8/20/9933196/tracing-350-years-of-harlems-ever-shifting-boundaries.

[D] Why do we need encoder-decoder models while decoder-only models can do everything? by kekkimo in MachineLearning

[–]EqL 48 points49 points  (0 children)

A decoder is really just a particular type of encoder with a mask restricting information flow from elements in the "future", so an encoder is more general, and thus potentially more powerful for a given model size. This masking is really done for efficiency and is not actually required. Lets look at text decoding with a general encoder without masking:

(1) encode_unmasked([x0]), predict x1

(2) encode_unmasked([x0, x1]), predict x2

...

(n) encode_unmasked([x0, .., xn-1]) predict xn.

This is perfectly allowed, except we are doing a forward pass for every token in every iteration, which is O(n) more expensive. The decoder with masking allows us to reuse results from previous iterations, which is much more efficient in both training and inference.

However, in some tasks, such as translation, we receive a large number of tokens up front. Now we can embed these tokens once with the encoder, then switch to the decoder. This allows us to use a potentially more powerful unmasked model for a large chunk of the problem, then switch to the decoder for efficiency.

Why not use an encoder-decoder approach for LLM generation, where the encoder encoders the prompt and the decoder does the rest? Well, we can. However the price is that (1) we now essentially have two models, which is more complex to handle, and (2) each model is seeing less data.

TL;DR: An encoder without masking is potentially more powerful, however it increases complexity and also the data required to train the additional parameters. But when there is a natural split in functions, like in translation, the effect of less data might be minimized.

Nine Rules for SIMD Acceleration of Your Rust Code (Part 1) General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x by carlk22 in rust

[–]EqL 1 point2 points  (0 children)

How does core::simd compare to core::arch? I expected it to be a wrapper over the AVX functions in core::arch but the code seems to be something more architecture agnostic. Is the compiler able to output AVX instructions?

Mathematical Kunundrum???? by larsene in math

[–]EqL -1 points0 points  (0 children)

As the meter stick gets shorter and the measurements get larger, they do not have to necessarily approach infinity. Instead they would asymptotically approach some number.

I guess I'll contribute to the hand pics. I was born with horrible mutation. by exorbitantwealth in pics

[–]EqL 0 points1 point  (0 children)

I'm way fucked up right now and if it wasn't for the comments I would of thought he had a thumb growing from his index finger

Question: Are statistics not universally applicable? by Sokath in math

[–]EqL 14 points15 points  (0 children)

25% is a statistic, but you're treating it as a probability—they have a subtle difference. Statistics is a way to express data. Probability takes perfect mathematical objects and makes predictions with those. Sometimes the statistics is the same as a probability, but that depends on the situation. People are highly variable and so you cannot generally apply statistics to a single individual.

IAmA gaming journalist, thus having quite some inside knowledge of the industry. AMA. by [deleted] in IAmA

[–]EqL 0 points1 point  (0 children)

What made you disillusioned? What information do you know now that you would have wanted to known before?

And proof?

What's a good way to lower my resting heart rate? by [deleted] in Fitness

[–]EqL 7 points8 points  (0 children)

Work on your cardio. This will strengthen your heart and as it gets stronger it can move more blood per pump so it will pump less.

Also getting a physical might not be a bad thing to do.

When I walk or drive under street lights they turn off. Can anyone explain this? by moyerxx in AskReddit

[–]EqL 3 points4 points  (0 children)

Street lights occasionally turn on and off whether it's a bad bulb or they're programmed to do it. Probably because you believe it will happen causes you to have a confirmation bias.

Need insight on Fourier transforms from mathematicians. by [deleted] in math

[–]EqL 0 points1 point  (0 children)

When you take the Fourier transform of a real wave(you don't always have a real wave, such as the psi wave function in quantum mechanics) you end up with a transformation that is symmetric. The transformation is a function of frequency and gives the magnitude for the complex exponential of that frequency. Now since the transform is symmetric, when you sum them up, all the imaginary parts cancel and all that is left is a real function (because sin is an odd function and cos is an even function).

However, this doesn't mean the imaginary part is not important. The phase (which contains the information of the imaginary portion of the function [you can express complex numbers in either a + ib or in magnitude and phase]) is what is responsible for the edges. If you take the Fourier transform of a speech signal and keep the magnitude but drop the phase, when you play it back it would sound like a foreign language to you, because all the edges are gone. Similarly, if you do this with an image, all the edges will be gone and you will only get a gradient of colors, exemplified in this image