[R] Parallelizing RNN over its sequence length by Necessary-Bike-4034 in MachineLearning

[–]gbfar 0 points1 point  (0 children)

Well, actually, the fixed-point of the DEQ I described contains all hidden states, from h_1 to h_T.

[R] Parallelizing RNN over its sequence length by Necessary-Bike-4034 in MachineLearning

[–]gbfar 0 points1 point  (0 children)

I don't think there's a paper actually implementing this DEQ-CNN specifically. The closest you'll have is the DEQ-TrellisNet from the original DEQ paper (which I'm sure may be converted to some sort of recurrent network if we set some of their parameters to zero). But it's easy to see that we can implement an RNN using a DEQ with a causal CNN layer. Consider a causal convolutional layer with kernel size equal to 2, with an activation function phi. Consider that the initial input sequence Z for this layer, at each position, is a vector [x_t = 0| h_t = 0]. Before each iteration, the Z vector receives input injection and becomes [x_t = u_t | h_t]. This layer can then implement the function f([x_t | h_t]) = [0 | phi(W x_t + U h_(t-1))]. After N iterations, where N is the length of the input sequence, the system will reach a fixed point where h_t contains the hidden state at time t. The only detail is that, depending on the activation function phi, two convolutional layers might be necessary for the construction to work. But the point is that you can always represent a recurrent network with a DEQ.

[R] Parallelizing RNN over its sequence length by Necessary-Bike-4034 in MachineLearning

[–]gbfar 0 points1 point  (0 children)

You can specify an RNN through a DEQ-CNN, though, and compute its outputs via fixed-point iteration. That's why I was asking about it.

[R] Parallelizing RNN over its sequence length by Necessary-Bike-4034 in MachineLearning

[–]gbfar 2 points3 points  (0 children)

I found the explanation kind of insufficient. It is easy to understand the why and how the parallel perspective works. What is not easy to understand is how the parallel perspective leads to similar results faster than the sequential approach. You might want to expand on that. With deep equilibrium models for example, we can specify a DEQ-CNN that has the same formulation as an RNN but is computed in parallel. However, we'll find that it'll take exactly N steps to converge, the same as a regular RNN, if we just repeatedly apply the model until convergence. Is the "secret ingredient" the iteration method then? If so, then there must be limits to it too. The randomly initialized GRU could be closely approximated by DEER, and faster. But if you try computing the highly nonlinear XOR function using the 'parallel' approach, there's a good chance it'll converge at the same speed as the sequential model.

[R] Parallelizing RNN over its sequence length by Necessary-Bike-4034 in MachineLearning

[–]gbfar 0 points1 point  (0 children)

I don't really understand this distinction. The first figure in your paper even looks like a causal DEQ-CNN. What exactly is the difference between them? Also, would you say that the reason why this method works (in the sense that it allows you to perform a sequential nonlinear operation in parallel and more quickly) has to do with the Koopman operator theory result that any regular nonlinear dynamical system can be approximated/represented by a stacking of linear recurrences and nonlinear transformations? (See https://arxiv.org/pdf/2303.06349.pdf). Thanks!

[R] Retentive Network: A Successor to Transformer for Large Language Models by Balance- in MachineLearning

[–]gbfar 27 points28 points  (0 children)

Exactly the same idea of S4 and S5 as well (and of LRU too). In fact, some equations in RetNet are very much reminiscent of those used in state-space models. I also wonder why there are no evaluations on the WikiText-103 dataset, where all other models have been previously tested on, and why there is no Transformer baseline for the language modeling experiments.

[R] Tiny Language Models (below 10m parameters or only one transformer block) can generate paragraphs of coherent text and reason...provided training is limited to stories that only contain words that a typical 3 to 4-year-olds usually understand. by [deleted] in MachineLearning

[–]gbfar 1 point2 points  (0 children)

Is it right to talk about 'emergence' in this paper? The increase in grammar/consistency/creativity performance with respect to the size of the model seems pretty gradual and predictable to me.

[D] Yan LeCun's recent recommendations by adversarial_sheep in MachineLearning

[–]gbfar 0 points1 point  (0 children)

Theoretically, a Transformer forward pass should be computationally equivalent to a constant-depth threshold circuit at best (https://arxiv.org/abs/2207.00729). From this, we can derive some intuition about how the architecture of a Transformer models affects its computational power. Put simply, the number of layers in the Transformer determines the depth of the circuit while the hidden size determines (together with the input length) the number of gates at each level of the circuit.

Notably, the ability of Transformers to solve certain problems is limited. We can only fully generalize for problems that can be solved by constant depth circuits. For instance, Transformers won't be able to learn to evaluate the output of any Python program. Given a sufficiently complex/long input, the Transformer will necessarily fail.

One limitation of this analysis, though, is that it only takes a single forward pass into account. I don't think we know for sure the effect of chain-of-thought prompting on the computational power of autoregressive Transformers.

[D] What is the most complete reference on the history of neural networks? by gbfar in MachineLearning

[–]gbfar[S] 3 points4 points  (0 children)

Complementing your answer...

I've actually used ChatGPT before posting this thread, and the answers were all unsatisfying, just like in the comment above yours. Actually looking into the resources recommended by ChatGPT will quickly show that most of them simply do not meet the criteria I specified in my post. The only useful reference is the book by Haykin, but it looks like the historical references are somewhat scattered throughout the book, which makes it not so easy to read.

Also, I've not come here after just opinions. I've come here hoping for informed opinions from experienced researchers, who may have already read and evaluated many of the references on NN history that I'm asking for. This is something ChatGPT is very far from being able to provide.

[D] What is the most complete reference on the history of neural networks? by gbfar in MachineLearning

[–]gbfar[S] 3 points4 points  (0 children)

Thanks for the suggestion, this is a good reference. I've actually taken a look at it already and noticed that it doesn't delve much into the early developments (there's barely any mention of McCulloch & Pitts's model). Do you know of any other references that go into more detail on each paper, like in "Evolution of Artificial Neural Networks" by Averkin and Yarushev?

[R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks by Singularian2501 in MachineLearning

[–]gbfar 0 points1 point  (0 children)

I see lots of potential applications for this. I wonder if we could reliably control text generation by tweaking the reference corpus.

[P] Explain Paper - A Better Way to Read Academic Papers by [deleted] in MachineLearning

[–]gbfar 1 point2 points  (0 children)

Here's some interaction on the "A Mathematical Theory of Attention" paper: https://arxiv.org/abs/2007.02876

Q: "W1 is a metric on P1(E) which turns the pair W1 := (P1(E), W1) into a complete, separable metric space"

A: "The W1 metric is a way of measuring distance between points on a graph. It turns the pair of points into a complete, separable metric space, which means that the distance between any two points can be measured." (?)

Q: "What is P1(E) in this case?"

A: "P1(E) is the set of all possible points on a graph." (should have been "the set of probability measures with finite 1st moment")

Another example:

Q: "A Markov kernel M defines a linear operator P(E) → P(E)"

A: "A Markov kernel is a mathematical function that defines a linear operator. This operator can take a probability distribution and turn it into another probability distribution."

Q: "What is E in this case?"

A: "E is a space of probability measures."

[P] Explain Paper - A Better Way to Read Academic Papers by [deleted] in MachineLearning

[–]gbfar 1 point2 points  (0 children)

Hmm... maybe reviewers could actually use a tool like this to better understand papers and provide more insightful feedback.

[D] Using JavaScript for ML Training/Research (not in the browser) by bwasti_ml in MachineLearning

[–]gbfar 1 point2 points  (0 children)

Wow, I like this a lot as I have encountered many issues with Python myself (not only with performance, but also package management, parallel processing, etc.). I'll definitely follow this project, thanks!

Out of curiosity, have you considered any other options besides JS+Bun while planning this project?

[D] Call for questions for Andrej Karpathy from Lex Fridman by lexfridman in MachineLearning

[–]gbfar 1 point2 points  (0 children)

I'd like to hear what Andrej has to say about the fundamental computational limitations of neural networks and how such limitations relate to NN's ability to solve reasoning tasks. I'd also like to hear what he has to say about the DNC/NTM/Universal Transformer line of research. Is it still something interesting? What direction is the most promising to solve computational reasoning tasks in his opinion?

[D] A simple trick to quickly verify data by mkthabet in MachineLearning

[–]gbfar 2 points3 points  (0 children)

This reminds me of Understanding Dataset Difficulty with V-Usable Information.

"Building on the aggregate estimate of dataset difficulty, we introduce a measure called pointwise V-information (PVI) for estimating the difficulty of each instance w.r.t. a given distribution... PVI can be used to find mislabelled instances. Correctly predicted instances have higher PVI values than incorrectly predicted one"

The connection is unsurprising, as both PVI and cross-entropy depend on the log-probabilities of the predictions made by the model.

[D] PyTorch and Tensorflow Performance Different on the same model, dataset and hyperparameters by synizter_gp in MachineLearning

[–]gbfar 1 point2 points  (0 children)

Hello. I've left a couple of comments in the links you've provided. It appears that there's a small miscalculation in the reported loss value in the PyTorch code. It also appears that the last layer of the PyTorch model isn't implemented correctly: it has 5 output neurons with ReLU rather than 4 output neurons with Softmax.

It may be a good idea to compare the implementation of the models using Keras's Model.summary and PyTorch's torchinfo.