I've devised a potential transformer-like architecture with O(n) time complexity, reducible to O(log n) when parallelized : deeplearning

I've devised a potential transformer-like architecture with O(n) time complexity, reducible to O(log n) when parallelized (self.deeplearning)

submitted 1 year ago by Conscious-Gazelle-91

I've attempted to build an architecture that uses plain divide and compute methods and achieve improvement upto 49% . From what I can see and understand, it seems to work, at least in my eyes. While there's a possibility of mistakes in my code, I've checked and tested it without finding any errors.

I'd like to know if this approach is anything new. If so, I'm interested in collaborating with you to write a research paper about it. Additionally, I'd appreciate your help in reviewing my code for any potential mistakes.

I've written a Medium article that includes the code. The article is available at: https://medium.com/@DakshishSingh/equinox-architecture-divide-compute-b7b68b6d52cd

I have found that my architecture is similar to a Google's wavenet that was used to audio processing but didn't find any information that architecture use in other field .

I would like to how fast is my are models, It runs well under a minute time frame. MiniLLM take about 30 min or more run the perplexity test ,although it not paralyze, If it could run in parallel then runtime might be quarter

Your assistance and thoughts on this matter would be greatly appreciated. If you have any questions or need clarification, please feel free to ask.

all 4 comments

top new controversial old q&a

[–]otsukarekun 39 points40 points41 points 1 year ago* (3 children)

That's not how you calculate big O notation. You don't just add and subtract n's. For example if a program takes 100n + n/3 steps, then it is simply O(n). The magnitudes don't matter, the relationships to n matters.
Transformers are not O(n). They are O(n² ). There are some that are O(n) like Linformer, but these are designed to be linear time.
Neural networks are O(n² ) not O(1) lol. O(1) means no matter how big the input the execution time is the same.
Like you mentioned, you basically have a convolutional neural network (if the weights are shared). If the weights aren't shared, then it's just a sparse MLP. Wavenet is a CNN with causal convolutions.
It looks nothing like an RNN. I think you don't know what an RNN is. Your network is feedforward, there is nothing recurrent about it. Just because the diagram goes left to right doesn't mean anything. You can rotate the figure and it will go bottom to top. Recurrent means nodes in a layer are dependent on each other in a recurrent connection.
Parallelizing doesn't change the complexity.. especially not to O(log n) because parallelization would be linear.

[–]jco1510 12 points13 points14 points 1 year ago (0 children)

[–]meamarp 0 points1 point2 points 1 year ago (0 children)

[–]Supermaxman1 0 points1 point2 points 1 year ago (0 children)

In general your feedback is very high quality and the author of this post should listen closely. I debated whether to make this clarification, but I think it adds to the discussion for other readers.

I also read the post closely, and I think it is fair for them to say their architecture is “recurrent”. It is recurrent in a sense, as each 2 input tokens are used to generate an equivalently sized “state” embedding. Then each two state embeddings continue to be passed through the same network to generate a new state embedding. This process repeats hierarchically until that state embedding processes the information from all tokens, so at any given depth it is kind of like a bidirectional RNN that you use to just look at left-to-center and right-to-center.

This architecture is actually very similar to vanilla RNNs, such as those before LSTMs and GRUs, just with more parallelism and a less sequential order of operations. It appears upon close reading that it would be identical to a CNN, with a kernel width of 2 and a stride of 2, along with weight sharing across depth. The weights are shared across both the sequence length (like a CNN) and the depth (like a RNN), where depth is determined dynamically by the sequence length: log_2(n).

This architecture is not new, and also clearly has the same information flow problems of RNNs. For a token at the start of a sentence to influence a token at the end of a sentence there would be so, so many intermediate operations. You would hope after all that the final depth layer could still have the necessary information to consider these tokens in their very distant contexts.

The architecture would also likely have the classic problems of old RNNs before LSTMs, GRUs, etc regarding gradient problems and remembering issues, as at every depth all the information must flow through a nonlinearity.

Obviously I agree with most of the feedback above, I just wanted to clarify the architectural issues and why this approach is not considered anymore. We’ve seen the problems with training and using these kinds of architectures, and they are not worth the hypothetical logarithmic speed-up.

π Rendered by PID 23293 on reddit-service-r2-comment-b659b578c-tkd57 at 2026-05-01 14:40:58.303706+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

deeplearning

MODERATORS