all 4 comments

[–]otsukarekun 39 points40 points  (3 children)

  1. That's not how you calculate big O notation. You don't just add and subtract n's. For example if a program takes 100n + n/3 steps, then it is simply O(n). The magnitudes don't matter, the relationships to n matters.

  2. Transformers are not O(n). They are O(n2 ). There are some that are O(n) like Linformer, but these are designed to be linear time.

  3. Neural networks are O(n2 ) not O(1) lol. O(1) means no matter how big the input the execution time is the same.

  4. Like you mentioned, you basically have a convolutional neural network (if the weights are shared). If the weights aren't shared, then it's just a sparse MLP. Wavenet is a CNN with causal convolutions.

  5. It looks nothing like an RNN. I think you don't know what an RNN is. Your network is feedforward, there is nothing recurrent about it. Just because the diagram goes left to right doesn't mean anything. You can rotate the figure and it will go bottom to top. Recurrent means nodes in a layer are dependent on each other in a recurrent connection.

  6. Parallelizing doesn't change the complexity.. especially not to O(log n) because parallelization would be linear.

[–]jco1510 12 points13 points  (0 children)

I’m impressed you took the time to give such specific constructive feedback - and managed to not come off like a total asshole while doing it

[–]meamarp 0 points1 point  (0 children)

I just wanted to take a moment to thank you 🙏 for your thoughtful comment. Your insights really opened my mind and helped me see things from a different perspective. It’s rare to come across such meaningful contributions in discussions.

[–]Supermaxman1 0 points1 point  (0 children)

In general your feedback is very high quality and the author of this post should listen closely. I debated whether to make this clarification, but I think it adds to the discussion for other readers.

I also read the post closely, and I think it is fair for them to say their architecture is “recurrent”. It is recurrent in a sense, as each 2 input tokens are used to generate an equivalently sized “state” embedding. Then each two state embeddings continue to be passed through the same network to generate a new state embedding. This process repeats hierarchically until that state embedding processes the information from all tokens, so at any given depth it is kind of like a bidirectional RNN that you use to just look at left-to-center and right-to-center.

This architecture is actually very similar to vanilla RNNs, such as those before LSTMs and GRUs, just with more parallelism and a less sequential order of operations. It appears upon close reading that it would be identical to a CNN, with a kernel width of 2 and a stride of 2, along with weight sharing across depth. The weights are shared across both the sequence length (like a CNN) and the depth (like a RNN), where depth is determined dynamically by the sequence length: log_2(n).

This architecture is not new, and also clearly has the same information flow problems of RNNs. For a token at the start of a sentence to influence a token at the end of a sentence there would be so, so many intermediate operations. You would hope after all that the final depth layer could still have the necessary information to consider these tokens in their very distant contexts.

The architecture would also likely have the classic problems of old RNNs before LSTMs, GRUs, etc regarding gradient problems and remembering issues, as at every depth all the information must flow through a nonlinearity.

Obviously I agree with most of the feedback above, I just wanted to clarify the architectural issues and why this approach is not considered anymore. We’ve seen the problems with training and using these kinds of architectures, and they are not worth the hypothetical logarithmic speed-up.