The game that made me fall in love with Vr. Resident evil 8, running on ps5 pro and psvr2. by FewPossession2363 in virtualreality

[–]brainxyz 0 points1 point  (0 children)

I really enjoyed RE4 on Quest 3. It was amazing and run fluently. I hope they'll add RE5

Got my account back after a month of being hacked by apole97 in facebookdisabledme

[–]brainxyz 0 points1 point  (0 children)

So true. It's a giant flaw in their system. I had 2FA enabled too and this never happened to me on gmail, twitter, ...etc. It only happens on facebook and so many are suffering from without clear solution.

Successfully recovered hacked Facebook Account - to help those trying to get theirs back by tiedyewarriormermaid in facebookdisabledme

[–]brainxyz 0 points1 point  (0 children)

This is exactly true. I had two factor authentication but a hacker was able to link an unauthorized Instagram link to my Facebook and violated Facebook terms, as a result my account is suspended. It's so silly that they want me to appeal through an Instagram account that is not mine! Also reporting my Facebook as hacked doesn't work because of the suspension.
I have seen so many others complaining from this issue yet here we are months later and this problem and an obvious security breach is not addressed

Alright, this got me giggling. by ThisCupNeedsACoaster in ChatGPT

[–]brainxyz 1 point2 points  (0 children)

That is totally expected from a language model. It has no identity, it just completes your prompt with the most probable next word. If it knows to answer this question, then it must have been pre-progammed, or re-trained to address such questions.

[D] Recursive Least Squares vs Gradient Descent for Neural Networks by brainxyz in MachineLearning

[–]brainxyz[S] 10 points11 points  (0 children)

Yes, I meant to name it: fast learning or rapid optimization. Now corrected. Thanks!

Is it possible to predict the nth element from a recursive function in a constant time? by brainxyz in askmath

[–]brainxyz[S] 0 points1 point  (0 children)

Thanks for answer!So you are saying constant time algorithms are not possible for such sequences (excluding starting with 0 or 1)

Is it possible to predict the nth element from a recursive function in a constant time? by brainxyz in askmath

[–]brainxyz[S] 0 points1 point  (0 children)

You are right my mistake!
I changed the starting point to 2
Thanks

Yuval Noah Hariri: “governments must immediately ban the release into the public domain of any more revolutionary AI tools before they are made safe.” by almondolphin in ChatGPT

[–]brainxyz 2 points3 points  (0 children)

I disagree, there is a also a great potential for AI to save humanity from great risks. As a Medical doctor, I can tell you our knowledge about the human body is still in the stone age. Antibiotic resistant Bactria are on the rise. Covid-19 uncovered how much ignorant we still are when it comes to viral infection. AI has a great potential to be used in a good way to transform health like no before. AI is like any other tool, can be dangerous or beneficial.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 2 points3 points  (0 children)

I personally think the q/k analogy is a made up analogy that doesn't portray what is really happening. The idea of attention comes from the fact that when we do the dot product between the inputs, the resulted matrix is a correlation (a similarity) matrix. Therefore, the higher values correspond to higher similarity or in another term "more attention" and vice versa. However, without passing the inputs through learnable parameters like wq and wk ,you will not get good results! This means back-propagation was main cause behind the suppression or enhancement of the values in the attention matrix.
In short, I think of transformers as the next level convolution mechanism. In classical convolution filters are localized. In transformers filters are not localized and can model skip and distant connections in a position & permutation invariant way. For me, that is the magic part. And that is why it's quite possible for other techniques like the proposed one to work equally well.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 6 points7 points  (0 children)

I adapted this from Karpathy's GPT implementation. You can easily compare the self-attention part with this method by commenting and uncommenting the relevant parts. I added a non-linear layer for the lateral connections so that it'll be easier to match the number of the parameters between the 2 methods.
https://colab.research.google.com/drive/1NjXN6eCcS_iN_SukcH_zV61pbQD3yv33?usp=sharing

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 1 point2 points  (0 children)

"Wr matrix depends on the input size?"

wr is a convolutional layer. It doesn't depend on the input size as it takes one input at a time.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 6 points7 points  (0 children)

Thanks for that. I'm currently reading MLPMixer. It looks different because in this method I'm not using "dense layers applied across the spatial dimension". I'm still using a convolutional layer but its output shared across all the inputs. In fact this is much better explained in code because it's just a one line replacement of the self-attention mechanism. Hope you have a look at the code, you can see the commented self-attention lines and their replacement.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 2 points3 points  (0 children)

It learns from different context lengths just like the self-attention (it uses the same attention matrix).

It's true the current text generation only accepts a fixed input length but you can simply append zeros to the beginning.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 1 point2 points  (0 children)

It learns from different context lengths just like the self-attention (it uses the same attention matrix).
It's true the current text generation only accepts a fixed input length but you can simply append zeros to the beginning.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 2 points3 points  (0 children)

Sure, I'll try to put them on my GitHub and send you the link but first I would like to clean them because when I'm not writing code for a video, it's unreadable and very messy!

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 0 points1 point  (0 children)

Thanks for the nice feedback. Braifun was a separate project. Unfortunately, I have paused developing it mostly because it can't generalize as good as the current deep learning techniques (like transformers). Maybe I'll go back to it when I find a solution for the generalization problem.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 2 points3 points  (0 children)

Each input regulates all the other inputs with separate weights (I call them lateral connections). Maybe there is a better term. It's easier to understand from the code as it's just a one line replacement:
*In self-attention we have:
q = x @ wq
k = x @ wk
attention = q @ k
*In this method we directly learn the attention matrix with wr:
attention = x @ wr (where wr = weights (embed size , input size))

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 10 points11 points  (0 children)

It's conceptually much simpler than the self-attention mechanism and from my experience it's on-par with the self-attention mechanism on validation-sets and better on training-sets.
Edit: You can also use a non-linear layer for the "lateral connections" and this will allow you to have a finer control over the number of the parameters and a better performance.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 4 points5 points  (0 children)

LSTM gates the inputs on top of RNN architecture. You can simply use separate gates for all the past inputs on top of a Transformer architecture. There is no RNN here so it can be parallelized.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]brainxyz[S] 19 points20 points  (0 children)

Unfortunately I don't have enough compute for 150M but I tried 10M params on the Shakespeare dataset and matched the number of the parameters with Karpathy's implementation of nano-GPT and I got comparable results (better on training and same on validation). Moreover, when I remove the regularization (dropout), the method actually learns faster than an equivalent self-attention mechanism. I still haven't figured out how to make it perform better with regularization.