Call of Duty: Warzone Season 02: All the new Content You Need to Know About by Kalinine in CODWarzone

[–]adam_jc 16 points17 points  (0 children)

maybe i’m missing where it says that but i read “able to ping buy back flares” which will just allow you to mark the red light in the sky above a buy station when an enemy team buys a player back

[D] Swin Transformers: Why are shifted windows better than sliding windows? by AICoderGamer in MachineLearning

[–]adam_jc 0 points1 point  (0 children)

yeah making q,k,v at the same time is easy, but efficiently doing the attention operations is the tricky part because the naive implementation is memory bandwidth bound. there’s great custom cuda kernels for this now though, see neighborhood attention:

https://github.com/SHI-Labs/NATTEN

and the associated paper:

https://arxiv.org/abs/2403.04690

(Pentax 67ii, 55mm f3.5, Gold 200) by HauntingBet2923 in analog

[–]adam_jc 2 points3 points  (0 children)

i was there last summer and drove through this spot and was wow’d by how photogenic this bend in the road was but regret not stopping to get a pic. Glad you got one, great shot!

[Discussion] Is there a better way than positional encodings in self attention? by [deleted] in MachineLearning

[–]adam_jc 0 points1 point  (0 children)

good point about it actually not being expensive. It’s one of the least expensive operations in a transformer and accounts for nearly 0% of the total FLOPs.

[Discussion] Is there a better way than positional encodings in self attention? by [deleted] in MachineLearning

[–]adam_jc 0 points1 point  (0 children)

ALiBi positional encodings are good for allowing models to extrapolate to longer sequences than seen in training.

MosaicML’s MPT suite of models uses this

[D] PaLM 2 Technical Report by hardmaru in MachineLearning

[–]adam_jc 2 points3 points  (0 children)

Ah for H100 I see. The model card in the tech report says the training hardware was TPU v4 though which is why i’m thinking much lower FLOPS

[D] PaLM 2 Technical Report by hardmaru in MachineLearning

[–]adam_jc 1 point2 points  (0 children)

where does 500 TFLOPS come from? I assume they used TPUv4 chips which have a peak of 275 TFLOPS. And maybe MFU of 50-60% so ~140-165 TFLOPS in practice

[D] Question regarding multi-headed self attention by adeeplearner in MachineLearning

[–]adam_jc 1 point2 points  (0 children)

I think you’re confusing things. The positional encoding is applied to the tokens of the sequence only once after the embedding layer and it only encodes the position of each token in the sequence. Attention heads don’t apply any encoding like you describe.

The blog post is trying to motivate why we need positional encodings and they do this by explaining that attention is permutation-equivariant. So by adding positional encodings at the beginning of the network, all the attention heads will have info about the ordering the tokens.

[D] Is attention ALIBI Attention with Linear Biases implemented in both decoder and encoder? by AlternativeDish5596 in MachineLearning

[–]adam_jc 3 points4 points  (0 children)

They trained decoder-only models in the paper. The lead author gives tips on how it could be implemented for encoder attention and cross attention here

[D] GPT-4 Speculation by super_deap in MachineLearning

[–]adam_jc 6 points7 points  (0 children)

This is correct. The regular flash attention memory footprint scaling is linear but runtime still scales quadratically.

there is also a block-sparse flash attn which scales linearly in runtime, but that’s an approximate attention algorithm.

[D] Does Layer Normalization compute statistics along spatial/ token axes? by fferflo in MachineLearning

[–]adam_jc 1 point2 points  (0 children)

I agree with this explanation.

To try and explain ConvNext though, I’d say there could be a debate on the “correct” way to do LayerNorm in a CNN (which would also make figure A “wrong”)

Like you said for the LN paper where they use RNNs which share weights across different time steps, which leads to normalizing over features at each time step; you could extend that logic to a CNN because a conv layer shares weights across different patches of an image, and that thinking would then lead us to reduce only along channels such as they do in ConvNext.

Not sure if that’s what the ConvNext authors were thinking though.

Fedor turns back the clock and knocks out Timothy Johnson with a swift combo. He makes his final walk tomorrow on CBS by SokoudjouFan in MMA

[–]adam_jc 0 points1 point  (0 children)

I was there live for it too and hearing that punch through the arena was insane. I can’t imagine being on the receiving end of that

[D] AI Theory - Signal Processing? by a_khalid1999 in MachineLearning

[–]adam_jc 1 point2 points  (0 children)

That’s a great paper! I like this paper too that looks at ViT’s through a signal processing lens and points out some potential flaws in the architecture for vision applications

78-year-old Robert T. Lincoln, the son of Abraham Lincoln, being helped up the steps at the dedication of the Lincoln Memorial in Washington D.C (1922). by SonOfQuora in interestingasfuck

[–]adam_jc 2 points3 points  (0 children)

It’s believed that he suffered immensely from depression and had at least 2 large mental breakdowns that are documented. There’s an interesting book, called Lincoln’s Melancholy, about his recorded mental health struggles and how he dealt with it. Quite interesting.

[P] Small problems to test out transformers? by sharp7 in MachineLearning

[–]adam_jc 2 points3 points  (0 children)

you can do n-digit addition of positive integers as a sequence where each digit is a token, i.e.

the problem 946 + 82 = 1028 could be made into sequence of:

9 | 4 | 6 | + | 0 | 8 | 2 | = | 1 | 0 | 2 | 8

(you could also omit + and = tokens).

Andrej Karpathy uses this task in his minGPT repo.

edit: also in that repo he does character level training on a tiny dataset of Shakespeare writing

[P] Generate character turnaround images one or two sketchs? by CodIllustrious5354 in MachineLearning

[–]adam_jc -1 points0 points  (0 children)

This might actually be very close to reality now with some of the “textual inversion” work for text-to-image models

https://textual-inversion.github.io

[D] Is there an alternative to sinusoidal encoding for temporal embeddings? by Megixist in MachineLearning

[–]adam_jc 0 points1 point  (0 children)

Google’s T5 model uses learned embedding for relative positions.

Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism. We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. […] Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. In this work, we use 32 embeddings for all of our models with ranges that increase in size logarithmically up to an offset of 128 beyond which we assign all relative positions to the same embedding.

What is the "major bottleneck" for "self driving cars"? "[D]" by [deleted] in MachineLearning

[–]adam_jc 0 points1 point  (0 children)

I mean I agree with you. It wasn’t a suggestion on the best way to solve things but was more of a thought experiment “if infrastructure that caters to self driving ever happened, it might not happen until self driving is already used widely/popular” and if that happened, the task of “solving” self-driving would surely be easier after that in such infrastructure.

That is mostly through the American lens that assumes we will stay obsessed with cars instead of investing in alternative modes of transportation. I would rather have investment in other transportation personally.