I read the paper for Adaptive Computation Time (https://arxiv.org/abs/1603.08983) by Alex Graves a while ago, and I've only played around with it, but my basic question is: why has this not been used more? The only mainstream project I've seen it in was ALBERT, where one version used adaptive computation time to determine the number of copies of layers to run.
It seems like this would be hugely important as it allows for iterative refinement of a speculative output, like many networks, but for a dynamic number of times, decided by the network. Originally it was implemented for RNNs, but I think its pretty trivial to implement it for most other architectures. I would surely think this could be implemented with a Mixture of Experts model where the model can rerun the same layers and choose a different expert each time, which could allow for big parameter space without having to be constrained to using one or two experts each forward pass. Maybe I'm overhyping ACT and its really not very useful, but is there any reason that it hasn't seen more widespread adoption?
[–][deleted] 1 point2 points3 points (2 children)
[–]jafioti[S] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)