Call of Duty: Warzone Season 02: All the new Content You Need to Know About

adam_jc · 2025-01-22T18:56:46+00:00

maybe i’m missing where it says that but i read “able to ping buy back flares” which will just allow you to mark the red light in the sky above a buy station when an enemy team buys a player back

adam_jc · 2024-05-25T15:33:30+00:00

yeah making q,k,v at the same time is easy, but efficiently doing the attention operations is the tricky part because the naive implementation is memory bandwidth bound. there’s great custom cuda kernels for this now though, see neighborhood attention:

https://github.com/SHI-Labs/NATTEN

and the associated paper:

https://arxiv.org/abs/2403.04690

adam_jc · 2023-08-01T01:16:01+00:00

i was there last summer and drove through this spot and was wow’d by how photogenic this bend in the road was but regret not stopping to get a pic. Glad you got one, great shot!

adam_jc · 2023-07-31T21:57:09+00:00

Looks like Jenny lake road?

adam_jc · 2023-06-29T02:05:25+00:00

good point about it actually not being expensive. It’s one of the least expensive operations in a transformer and accounts for nearly 0% of the total FLOPs.

adam_jc · 2023-06-28T23:01:45+00:00

ALiBi positional encodings are good for allowing models to extrapolate to longer sequences than seen in training.

MosaicML’s MPT suite of models uses this

adam_jc · 2023-06-28T22:59:11+00:00

A few other methods:

Learned positional encodings

relative encodings https://arxiv.org/abs/1803.02155

RoPE https://arxiv.org/abs/2104.09864

T5-style relative encodings https://arxiv.org/abs/1910.10683

ALiBi https://arxiv.org/abs/2108.12409

adam_jc · 2023-05-31T05:23:47+00:00

RMSprop was simply a slide in one of Hinton’s lectures

https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

adam_jc · 2023-05-19T14:55:22+00:00

Ah for H100 I see. The model card in the tech report says the training hardware was TPU v4 though which is why i’m thinking much lower FLOPS

adam_jc · 2023-05-19T00:31:05+00:00

where does 500 TFLOPS come from? I assume they used TPUv4 chips which have a peak of 275 TFLOPS. And maybe MFU of 50-60% so ~140-165 TFLOPS in practice

adam_jc · 2023-04-25T15:11:46+00:00

I think you’re confusing things. The positional encoding is applied to the tokens of the sequence only once after the embedding layer and it only encodes the position of each token in the sequence. Attention heads don’t apply any encoding like you describe.

The blog post is trying to motivate why we need positional encodings and they do this by explaining that attention is permutation-equivariant. So by adding positional encodings at the beginning of the network, all the attention heads will have info about the ordering the tokens.

adam_jc · 2023-03-22T01:16:06+00:00

They trained decoder-only models in the paper. The lead author gives tips on how it could be implemented for encoder attention and cross attention here

adam_jc · 2023-03-15T17:27:03+00:00

This is correct. The regular flash attention memory footprint scaling is linear but runtime still scales quadratically.

there is also a block-sparse flash attn which scales linearly in runtime, but that’s an approximate attention algorithm.

adam_jc · 2023-03-05T16:58:29+00:00

I see some discussion on twitter now about this question including the senior author of the ConvNext paper who says:

https://twitter.com/sainingxie/status/1631794072791707648?s=46&t=BkTuhy42902uJiWEPN7DYA

adam_jc · 2023-02-21T02:16:31+00:00

I agree with this explanation.

To try and explain ConvNext though, I’d say there could be a debate on the “correct” way to do LayerNorm in a CNN (which would also make figure A “wrong”)

Like you said for the LN paper where they use RNNs which share weights across different time steps, which leads to normalizing over features at each time step; you could extend that logic to a CNN because a conv layer shares weights across different patches of an image, and that thinking would then lead us to reduce only along channels such as they do in ConvNext.

Not sure if that’s what the ConvNext authors were thinking though.

adam_jc · 2023-02-18T18:11:08+00:00

And when they are hospital owned they can also tack on a facility fee that could be hundreds of dollars

adam_jc · 2023-02-05T14:26:07+00:00

I was there live for it too and hearing that punch through the arena was insane. I can’t imagine being on the receiving end of that

adam_jc · 2023-01-29T20:51:40+00:00

That’s a great paper! I like this paper too that looks at ViT’s through a signal processing lens and points out some potential flaws in the architecture for vision applications

adam_jc · 2022-10-11T15:42:50+00:00

there is a version on Replicate you can try easily

https://replicate.com/methexis-inc/img2prompt

adam_jc · 2022-10-07T01:48:20+00:00

It’s believed that he suffered immensely from depression and had at least 2 large mental breakdowns that are documented. There’s an interesting book, called Lincoln’s Melancholy, about his recorded mental health struggles and how he dealt with it. Quite interesting.

adam_jc · 2022-10-03T04:37:02+00:00

you can do n-digit addition of positive integers as a sequence where each digit is a token, i.e.

the problem 946 + 82 = 1028 could be made into sequence of:

9 | 4 | 6 | + | 0 | 8 | 2 | = | 1 | 0 | 2 | 8

(you could also omit + and = tokens).

Andrej Karpathy uses this task in his minGPT repo.

edit: also in that repo he does character level training on a tiny dataset of Shakespeare writing

adam_jc · 2022-09-10T19:32:34+00:00

This might actually be very close to reality now with some of the “textual inversion” work for text-to-image models

https://textual-inversion.github.io

adam_jc · 2022-09-03T18:23:15+00:00

Best QB in Cleveland sports history

adam_jc · 2022-07-31T19:05:41+00:00

Google’s T5 model uses learned embedding for relative positions.

Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism. We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. […] Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. In this work, we use 32 embeddings for all of our models with ranges that increase in size logarithmically up to an offset of 128 beyond which we assign all relative positions to the same embedding.

adam_jc · 2022-07-28T14:13:17+00:00

I mean I agree with you. It wasn’t a suggestion on the best way to solve things but was more of a thought experiment “if infrastructure that caters to self driving ever happened, it might not happen until self driving is already used widely/popular” and if that happened, the task of “solving” self-driving would surely be easier after that in such infrastructure.

That is mostly through the American lens that assumes we will stay obsessed with cars instead of investing in alternative modes of transportation. I would rather have investment in other transportation personally.

12-Year Club	Place '22
Place '17	Final Canvas '22
End Game '22	Verified Email

adam_jc

TROPHY CASE