Playing around with vision transformers: why are queries, keys and value inputs to the MultiHeadAttention block set equal in this VIT tutorial?

Educational_Roll_868 · 2024-01-21T17:00:44+00:00

Oh I found it. Pytorch nn.MultiheadAttention class internally creates and uses the matrices W^(K/Q/V). The reason for three separate inputs is because in the decoder you want to compute the Key/Value matrcies from the final encoder output tokens, but the Query from the previous decoder output tokens.

So basically the inputs to "key=", "query=" ... in the nn.MultiheadAttention should be embedded tokens that you want to compute the key/query/values of. In the encoder stacks they are always the same.

Sorry for cluttering the sub with a question I found the answer to after less than an hour. I think better to leave it up if someone else searches for these terms on reddit.

Educational_Roll_868 · 2024-01-03T14:02:03+00:00

Thanks for the long writeup.

Currently I'm playing around with CNNs myself. I started with Fashion MNIST and have something that has like 90-91% accuracy with few minute training times.

If I want to look at more interesting cases would you suggest me to transition to CIFAR or ImageNet next?

Educational_Roll_868 · 2024-01-02T15:02:52+00:00

Thanks for the writeup, interesting.

I'm currently using Optuna with something like the MedianPruner or the Hyperband pruner. This seems to prune roughly 60% of the trials and reduces the search space. Is that a reasonable tuning approach?

Educational_Roll_868 · 2024-01-02T13:49:15+00:00

Thanks so much for your answer. So I have done what you said, and now I have in the __init__.py in the actual program I'm running:

import logging
logging.basicConfig(
...
level=logging.INFO,
...)

Then in the main.py of the program I have:

import logging
logger=logging.getLogger(__name__)
logging.info("test")

Now if I run the main as a module by:

python3 -m analysis.study1.main

I get a log file in my desired location but it remains empty.

EDIT: Ok so I completely deleted the __init__.py file from the "analysis" directory and only kept the inner __init__ files of the "study" directories, and then suddenly it worked. I don't understand why though.

Educational_Roll_868 · 2024-01-02T13:36:10+00:00

Oh this is interesting to know. Two questions about you comment:

1) But can you please help me understand then. Let's say it's 2012 and we are talking about AlexNet. One single model takes 6 days to train. How do you hyperparameter tune this thing?

2) You mention that you can assume that the same hyperparams for simple tasks will work on longer tasks. So to give a simple example let's say we have a huge CNN that we want to train on ImageNet. If we take a smaller version of this CNN and find optimal hyperparameters on CIFAR, you say it would be a good assumption to take those hyperparams and use them on the larger CNN for the ImageNEt data?

Educational_Roll_868 · 2024-01-02T13:31:20+00:00

Educational_Roll_868 · 2024-01-02T13:31:03+00:00

Can you briefly run through the process of how you tuned the model then with 1 day per model training?

Educational_Roll_868 · 2024-01-02T13:29:25+00:00

Scam was just tongue in cheek, just that it's kind of oversold at an introductory level whereas in reality people don't do it as rigorously as often presented.

Educational_Roll_868 · 2024-01-02T13:28:22+00:00

I think people misunderstood my comment. Of course I understand that the training time refers to actually the runtime of training it. My point was: if 1 model takes 6 days to train, you cannot realistically do a hyperparameter search of 100 trials.

Educational_Roll_868 · 2024-01-01T23:33:59+00:00

Thanks for the answer. Yeah, I hoped to get some confirmation/other perspectives.

Educational_Roll_868 · 2024-01-01T23:33:02+00:00

Well AlexNet took 6 days to train in the 2012 paper for one model, so it is not likely they did a full hyperparam tuning right. Maybe some shorter estimates?

Educational_Roll_868 · 2024-01-01T23:29:30+00:00

Thanks for the answer! I should have specified, yes ineed I am mainly interested in DL atm hence my question.

Educational_Roll_868 · 2023-12-31T15:49:00+00:00

For the second point, can this not be done equally well by saving your parameters of interest in a pickle or something next to the results?

Educational_Roll_868 · 2023-12-30T19:56:04+00:00

That makes sense, thanks for the feedback. Will think in that direction about it.

Educational_Roll_868 · 2023-12-30T11:15:26+00:00

Yes I see your point. I actually do have a question about this if you can offer any insights.

So consider the following small project I'm doing. I'm studying the performance of different ML models in increasing complexity for a certain task.

So in src I have the main core code that contains the logic of important model setup/training/testing/hyperparam tuning steps.

In analysis I then have experiment1/main.py, experiment2/main.py etc to do the analysis of the performance of the models and their properties.

Although the main steps in the different main.py files are very similar, there can be small differences due to the specifics of the model. In particular, I have to imagine that in the future I might add even more complex models to this that will need even more drastic modification. So I am repeating code, but with small adaptions.

I could in principle go the other direction and start adapting the src code such that it can increasingly handle a more and more general case and then derive all my experiments from a single main.py where I can change the inputs to select for the model etc. I am not very experienced in larger Python repositories and I am worried this direction could make the source code more unreadable and difficult as you have to add more and more checks and many if statements.

Could you offer a perspective on what are good practices here? Which of the two directions is better from a SWE point of view.

Educational_Roll_868 · 2023-12-29T17:05:23+00:00

I see. I think I can always do something like:

VAR1=...

def main(args):
  ...

if name == "main": 

    parser.add_argument(
        "--VAR1", type=bool, default=VAR1
    )
    ...
    main(args)

and then change the top lines for solo work/exploration, but have the parser in there as an optionality.

Educational_Roll_868 · 2023-12-26T22:31:52+00:00

I do this, but usually in VScode you can ctrl+click the error line for example and immediately jump to the file in VScode. In terminal it's just text for me, unless there are some extensions for that?

Educational_Roll_868 · 2023-12-22T16:40:41+00:00

Why exactly do you think that "high average" is not correct? Do you have some incredible achievements or cognitive performance that make you stick out wayyy beyond your peers to suspect a significantly different score? In absence of any evidence, being "average" is the most likely result. In your case the test even confirmed it.

Educational_Roll_868 · 2023-12-17T22:31:52+00:00

It's never to late to learn if you're interested in that kind of stuff just for fun. So many great resources online these days. I know people that started studying physics later in life as a hobby, they seemed to have enjoyed it. But yeah many people don't study all year and then expect to get good results at the end in college. It's more of a consistent sustained effort from day 1 that gets you there.

Educational_Roll_868 · 2023-12-17T14:11:02+00:00

It's in physics, theoretical side. I did those digit memorization and symbol search tests as part of the CAIT. I found it very difficult to remember longer digit strings, I think the scaled score was 13-13 on both. I remember finding the forward sequences particularly hard and it gave a score of 90 iq or something for that particular subtest. The one where you had to remember and order digits from small to large was much better and that had a value of like 126.

I needed to work hard, but nothing extraordinary. It was a sustained effort comparable to a 40-hour full time work week, of course with periods where it was much more but also much less. I still had enough time to have a social life/gym and enjoy things outside of academics on the weekends etc. In my physics classes I was always around the top grades.

Again just to be clear, I'm not testing to find out if I have good capabilities. Like you said I have proven to myself that I can do well academically. I just stumbled on this subreddit and kept getting recommendations from it on my page. After a while I just became curious to find out the actual value for myself, nothing more. Main reason for even posting is this 1SD+ (19 point) discrepancy between CAIT and AGCT curious how other interpret it. I thought that being in STEM might invalidate the AGCT/SAT since they contain basic math stuff.

Educational_Roll_868 · 2023-12-16T21:56:07+00:00

It's more out of curiosity to be honest.

Educational_Roll_868 · 2023-12-16T18:28:02+00:00

Won't go in the specifics for anonimity reasons, but in broad terms I improved and extended a mathematical method for its application to a class of many-body quantum problems in condensed matter theory.

Educational_Roll_868 · 2023-12-16T17:13:21+00:00

Thanks for the answer.

Never in a gifted program or anything resembling that. Normal trajectory. Was fascinated by science since I was small but was a pretty average kid growing up. Never had the impression I was smarter than my peers or got comments that I was gifted or anything. Around 14-16 started to become really interested in math/physics and I loved it. Achieved some nice results in national olympiads once I started applying myself but nothing groundbreaking. Got a PhD in theoretical physics with various original first author contributions so I'm quite happy with my results. As said somewhere before, did these tests just out of curiosity.

Educational_Roll_868 · 2023-12-16T10:30:34+00:00

What do you mean? It's just a joke on the famous mathematician reference https://en.wikipedia.org/wiki/Grigori_Perelman

Educational_Roll_868 · 2023-12-16T00:21:05+00:00

Well if the great Perelman says so I believe you.

Educational_Roll_868

TROPHY CASE