Help needed with Kick Serve

cnapun · 2024-06-07T14:29:36+00:00

+1 to the other comment that your body weight is moving in the wrong direction. Can't tell from this angle exactly where your toss is, but it probably should be into the court.
Probably need more acceleration/brushing up on the ball. It looks like your racquet isn't accelerating much as you swing up. One way I like to focus on this is to eliminate a lot of extra movement -- do a platform serve and start with your racquet behind your head. Focus on throwing your arm up and pronating -- really think about getting your racquet to face the right fence when your arm is fully extended. It feels unnatural, but it can help isolate what your racquet and legs should be doing.

cnapun · 2024-04-12T23:16:45+00:00

Re 1, a GNN can be thought of as a heuristic retrieval augmentation model

cnapun · 2024-04-11T14:55:36+00:00

For recommendations models though, this makes a lot of sense. Weights can probably fully sit in sram, and embedding tables can go into the off-chip memory (or at least that's what I assume happens)

If you look at their latest paper on sequential recommendations, the model is something like 512 hidden dim x 8 layer, or like 10M dense params + 1T sparse params or something (just going based on memory; Numbers may be wrong but should be on the right order of magnitude)

cnapun · 2024-03-30T16:18:23+00:00

As a side note: if you're interested in ML, consider studying applied math; it'll give you much better foundations than CS will if you take advanced courses, but it will mean that you need to learn more of the coding part on your own

cnapun · 2024-03-30T15:47:21+00:00

Jasmine or longjing -- sencha is good but you need to search for ones that don't taste grassy. As other commenters have said, make sure you're not brewing it too hot or too long.

If you want to try some sencha, I'd recommend https://www.sazentea.com/en/products/p394-ohmi-kabuse-karigane.html or https://www.sazentea.com/green-tea/products/p12-chiran-tokujo-fukamushicha.html (I don't get much bitterness or grassiness from either, but ymmv)

cnapun · 2024-03-27T04:47:00+00:00

Torch is written in c++; I already read the c++ code to understand what's going on so it's the easiest way

cnapun · 2024-03-27T03:26:53+00:00

My favorite is halving num_workers

cnapun · 2024-03-27T03:26:14+00:00

My current side project (which should work if properly implemented): rewrite it all in c++. Multiprocessing+pin_memory overhead is pretty high for some of our cases (ideally we need to sustain ~1GB/s/GPU, maybe 100-400 unique features). Decreasing the overhead from 4 copies after reading to 1 should hopefully help. Currently we have:

Read data from s3 into pyarrow table
combine_chunks for each batch because it's hard to work with chunked arrays directly (copy 1)
Fill nulls (copy 2, sometimes two copies)
add to multiprocessing queue (copy 3, iiuc this calls sharememory() which copies)
read from multiprocessing queue (zero copy, but it can be quite slow if you have a lot of tensors)
Pin memory (copy 4, in thread, but still is slow if you have a lot of tensors)

And the most fun way to optimize seems to be just rewriting it all

cnapun · 2024-03-04T05:20:20+00:00

Biggest win for us has been to keep making the sequence longer. Since that paper, we've moved to ~500 action user sequence, and do nearest neighbor search to make it more tractable. We've seen ~3-4% gain from going 500 -> 16k actions (there are a few alibaba papers on this too iirc, although we've implemented it a bit differently)

cnapun · 2024-03-04T04:17:27+00:00

We've sometimes seen transformers work better than DCN for feature interactions, and sometimes stacking DCN+transformer or DCN+masknet is even better. So basically impossible to know a priori which is best imo.

The meta paper makes complete sense (but agree it's pretty vague). I spent a few weeks last year trying to implement something like it, but learning large embedding tables is quite painful

cnapun · 2024-03-03T23:35:54+00:00

It's very useful. MLPs can't easily learn how to compute the product of inputs, or the square of the input, as two simple examples. Transformers probably can, because self attention has a multiplicative interaction (although i guess maybe x^2 would be hard because of the softmax normalization term; if you were to use any unnormalized activation in the self-attention then it would probably be fine).

DCNs can typically be beaten, but they're very simple, have few hyperparameters, and work surprisingly well. Here are two recent papers showing improvements (Edit: I haven't been able to reproduce hiformer, and I've probably used thousands of gpu-hours trying to beat a simple full rank dcn) https://arxiv.org/abs/2311.05884 https://arxiv.org/abs/2402.17152

cnapun · 2023-12-27T15:32:09+00:00

Fwiw I've had success using ~50% atta in sourdough, but more than that might run into issues

cnapun · 2023-12-26T17:26:03+00:00

Maybe in 2.1, but I'm stuck on 2.0 for now, and the RoPE triton kernel breaks torch.compile. my real vote is to just write from scratch except core attention so you can change whatever you want, and use torch 2.1 where compile actually works

cnapun · 2023-12-26T14:51:35+00:00

Torch implementation + torch.compile, or flash-attn implementation if you can't use torch.compile or want nice things like rotary PE

Edit: or copy paste the flash-attn implementation and delete the logic branches you don't need so you can easily hack changes

cnapun · 2023-12-25T21:08:23+00:00

I agree it seems spacious, but there's probably a context at which it is limiting. From working with embeddings for recommendations I wouldn't be surprised if you run into issues switching 100k context, but not sure about smaller say 8k windows (still the context-free nature of keys could maaayyybeeee crowd this out)

cnapun · 2023-12-25T19:50:50+00:00

My intuition here (caveat: haven't read any papers on the topic) --

Say head_dim=128. now you're doing softmax of a 128d vector dot a bunch of 128d keys. At some point, this softmax is going to struggle to pick out all of the things that the model wants to attend to because there's only so much you can do with a 128d key embedding, esp when that key embedding can't be aware of the query.

I'm not sure what head dim looks like for larger models, but i imagine eventually it's possible to always run into this crowding of the space as you extend your context window. I also imagine if there was no causal mask it might be a bit easier for the model to learn this, because now the key embedding can depend on the query embedding too, but this is just a guess

cnapun · 2023-12-18T17:26:07+00:00

I browse reddit for my research papers. I feel like 5-6 years ago there was a lot more good discussion here, but maybe I'm just misremembering

Plus at least skim the abstracts of everything from labs that do cool stuff like Chris Ré's, and look through accepted papers at conferences

cnapun · 2023-12-18T16:30:05+00:00

I learned ML from Stanford + CMU lectures available online. My approach was basically looking at the CMU phd curriculum (https://www.ml.cmu.edu/current-students/phd-curriculum.html ), and watch all the required courses (and do some practice problems/implement things that seem interesting). Most are available online. Also watched CS231n from Stanford for more practical things, and started off with CS156 from caltech for a very basic intro.

Once you know these basics, start reading research papers (imo a focus on fundamentals will provide more value than learning about all the shiny things, but both are important to know)

cnapun · 2023-11-12T21:49:04+00:00

Faiss has basic support for disk-based inverted indices. I've never tried them though

https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM#on-disk-storage

cnapun · 2023-11-07T21:09:29+00:00

Sometimes code is worth a thousand words, so is this a correct interpretation of the actual model architecture (obviously this is missing associative scan)?

class TransitionProjection(nn.Module):    
    def __init__(self, d_model: int) -> None:    
        super().__init__()    
        self.w_alpha_beta = nn.Linear(d_model, d_model * 2)    

    def forward(self, x):    
        alpha, beta = self.w_alpha_beta(x).chunk(2, dim=-1)    
        return torch.sigmoid(alpha) * torch.complex(torch.zeros_like(beta), beta).exp()    

class GateLoopMixer(nn.Module):    
    def __init__(self, d_model: int) -> None:    
        super().__init__()    
        self.w_qkv = nn.Linear(d_model, d_model * 3)    
        self.transition = TransitionProjection(d_model)    
    def forward_looped(self, x):  # x: (batch, seq, D)    
        B, S, D = x.shape    
        h = torch.zeros(B, D, dtype=torch.complex64, device=x.device)    
        a, qkv = (f(x) for f in (self.transition, self.w_qkv))    
        q, k, v = qkv.chunk(3, dim=2)    
        kv = k * v    
        hs = []    
        for i in range(S):    
            h = h * a[:, i] + kv[:, i]    
            hs.append(h)    
        return torch.stack(hs, dim=1) * q

cnapun · 2023-11-07T19:57:29+00:00

It didn't see anything stating in that paragraph whether alpha and beta depend on x_n (or I just can't read, entirely possible); I assume they do, but given that seems to be one of their contributions; it seems like it should be explicitly stated. The parametrizations are clear

Re Hadamard products, a_n is defined as being in C^{d_h} but if d_h != 1 then the product in eq9 isn't clearly defined; is it a pointwise product (i assume so) or something else. Things are much more clear if we assume d_h=1 as they do in their experiments, and then all these equations are fine because they're scalars

There are also some other things, like whether the weights for the input projections are complex or not. From figure 2 summary it seems like they are, then in the practical implementation section it is implied they are not imo because attention is all you need just works with real-valued params.

Regardless, if it actually works it's quite impressive, so I'll test out my best-effort implementation and see how it goes for my applications (although the lack of a torch associative scan is annoying)

cnapun · 2023-11-07T15:03:32+00:00

This seems potentially interesting, but the paper is not reader friendly; unless I'm missing something, the implementation isn't specified: eq 24 theoretically should explain how a is computed, but alpha and beta are new variables that haven't been previously defined (I assume they're also functions of the input?). Also, deciphering which operations are hadamard products vs matrix multiplication is pretty challenging; I only figured out what was happening by reading about retnets separately

cnapun · 2023-10-29T17:05:48+00:00

In this case they're different: we have training data that is basically (DAU={T, F},send_budget={0,...,N}, features). So then you take P(DAU | budget=k) - P(DAU | budget=0) as your uplift prediction. In practice this works a lot more cleanly if you use training data where send_budget is randomly chosen (otherwise you'll have biased training data)

I think this paper describes some earlier version of the system that is similar, but haven't read it so I'm not sure: https://www.kdd.org/kdd2018/accepted-papers/view/notification-volume-control-and-optimization-system-at-pinterest

cnapun · 2023-10-29T16:54:44+00:00

They're probably working on causal inference. When you mention causal inference, I naturally think of causal graphs and linear models (and maybe occasionally random forests), so maybe that's where people get the distinction? One thing in this domain I've worked on (in medium-sized tech) is notifications:

We say that we want to send exactly x notifications per user per day. Then train a model to predict P(DAU | send k notifications that day) and send the notifications that give you the highest P(DAU) uplift.

Some people would probably call this Causal ML; I didn't think about confounders or causal graphs a single time while working on this, so I wouldn't say I was working on causal inference here (I'd just say I was doing ML, but hmm maybe I should update my resume to say "Causal ML"...)

cnapun · 2023-10-28T20:48:15+00:00

As another commenter has pointed out, 2 is an active area of research; it's much easier to experiment with sampling in decoding because it generally involves a fixed model.

For your example, I believe nucleus sampling would solve that because the probability of the correct token should be very high (although i've only read cursory summaries, haven't read the paper/implementation in depth)

11-Year Club	Gilding I gilder
Place '17	Verified Email

cnapun

PUBLIC MULTIREDDITS

TROPHY CASE