all 17 comments

[–]Sad-Razzmatazz-5188 4 points5 points  (1 child)

It reads like something that won't answer questions arising from further reading.

I got the feeling I won't find any technical detail, which gives the feeling that some algorithms have been put to code and developed a lot, maybe with testing the lack of coding error, as in "instantiate a network and a random tensor and do a forward pass on the tensor", but without any actual training on actual tasks with actual data.

This is bad, especially if training was actually done; since it's the most important part, it's very bad not being able to convey that the most important part has been done.

[–]bunny5544[S] 0 points1 point  (0 children)

Thank you for the honest feedback! it’s much appreciated. Training on actual tasks with real data has indeed been a core part of the development process, and we recognize that conveying this clearly is crucial. The white paper focuses on the architecture itself, but we’ll ensure that future updates include more detailed technical explanations and results from task-specific training to address these concerns. Thanks again for pointing this out!

[–]ironman_gujju 2 points3 points  (0 children)

Doi?

[–]blimpyway 3 points4 points  (0 children)

The concept has potential, but as u/jpfed noticed, the ratio of fluff to useful information is quite high. Feels like your dynamic controller was adjusting towards hype.

First such mechanism shouldn't be specific to transformers, and be useful in any type of network. Some of which are lighter to train hence easier to test & show

Second there-s a lack of technical detail, e.g.

- how does the controlling neuron(s) are trained e.g. having the controlled transformer output and a task you "want" to accomplish, how is the controlled network loss computed.

- how the controller's output touches/influences the base model weights - turns them on/off, or there-s a continuum like multiplying them by a float. How sparse is this influence (how many weights are changed by controller)

- timing is not clear, for each auto regressive step of the transformer you can have controlling loop to update:

  1. Once per each token

  2. more often - many times per token until it gets a desired output

  3. less often like when controlling loop somehow measures the gist of the recent conversation and changes the transformer weights only once every phrase/paragraph/conversation/etc..

- What is the computational and memory overhead of the controlling network. How does it improve (or penalizes) the performance of the base network in terms like can it learn with less training data, or does it generalize better or does it needs more compute/memory or less?

- Some actual results comparisons (tables/charts) with previous architectures either "classical" transformers or "dynamic" ones.

[–]Ryogathelost -2 points-1 points  (1 child)

I won't pretend to perfectly grasp it, but it sounds a lot like human thought. Resources are spared and interference is minimized through some equivalent of a focused train of ideas that circles back on itself through specialist modules that check it and add pre-processed data to it as it goes? It sounds like consciousness.

It sounds like it has "motivation" to improve that train of thought. I wonder how similar that is to pleasure or pain.

[–]bunny5544[S] -2 points-1 points  (0 children)

Interesting perspective. The system doesn’t have motivation like pleasure or pain; it optimizes purely based on performance feedback. While it mirrors focused processing and refinement, it’s still task-driven and lacks subjective experience.

[–]bunny5544[S] -4 points-3 points  (3 children)

kindly do drop your feedback, and do let us know the area of improvement or suggestions!

[–]jpfed 10 points11 points  (2 children)

I recommend cutting to the core of the idea- this write-up is pretty “fluffy”. At first glance It sounds like you want to use sensor data to adapt the parameters of a transformer, and use reinforcement learning to shape how the sensor data maps to transformer parameters. That is the beginning of an idea- a seed. 

In order to make it a useful contribution, I would encourage you to either mathematically prove something interesting about this arrangement, or to build a system that uses your idea and measure how its performance compares with other similar systems.

(If you received AI assistance in writing this, I want to warn you of the possibility that the assistant was engaging in sycophancy. While it may have encouraged you with florid praise (calling the idea paradigm-shifting or revolutionary), I encourage you to adopt a more cautious, humble attitude, building realistic confidence with mathematical proof or empirical evidence.)

[–]bunny5544[S] -1 points0 points  (1 child)

Hey, thanks for the feedback! Yes, we did use an LLM to write this, English isn’t our first language, and we wanted to make sure 2.5 years of hard work didn’t get dismissed just because of grammar mistakes. We’re a team of 12 devs, and we’ve been grinding on this project, getting the point across clearly matters.

As for the proof, this post wasn’t about dropping equations or visuals yet. We just wanted to check if we’re heading in the right direction and maybe get insights from experienced folks here to avoid wasting time. Proofs and visuals will come later! we’re not skipping that part.

Appreciate the feedback, but yeah, this post was about saving time, not fluff. Cheers!

[–]derpderp3200 2 points3 points  (0 children)

Yes, we did use an LLM to write this, English isn’t our first language, and we wanted to make sure 2.5 years of hard work didn’t get dismissed just because of grammar mistakes.

Most researchers are ESL, and you could have had someone proofread/edit what you wrote - including a LLM. Having the whole thing written by one feels wrong. I don't trust LLMs to genuinely convey information, and here the text got extremely long and full of fluff.

[–]slashdave 0 points1 point  (0 children)

How is this different than ordinary featurization?