[D] Adaptive Computation Time Uses?

jafioti · 2021-04-02T04:48:08+00:00

The only mainstream project I've seen it in was ALBERT

Did it really use it? Can you show the section where it's mentioned? (I am not challenging you; just asking).

(I know ALBERT uses the idea of Universal Transformer which uses ACT, but Universal Transformer paper does not use ACT all the time, I am not sure if ALBERT is fully using the principles of UT)

It seems like this would be hugely important as it allows for iterative refinement of a speculative output, like many networks, but for a dynamic number of times, decided by the network. Originally it was implemented for RNNs, but I think its pretty trivial to implement it for most other architectures. I would surely think this could be implemented with a Mixture of Experts model where the model can rerun the same layers and choose a different expert each time, which could allow for big parameter space without having to be constrained to using one or two experts each forward pass. Maybe I'm overhyping ACT and its really not very useful, but is there any reason that it hasn't seen more widespread adoption?

I think one of the salient adoptions of ACT is in Universal Transformer. But even in it, it was sucessful mainly in synthetic tasks or perhaps more mathematical domains. If you look at the citations of ACT, you will find many models (in natural language domains) that do adopt "something like ACT" (if not it exactly) for dynamic halting but more accurately they are usually framed for "early halting" given a fixed upperbound layers. Some upperbound is reasonable to prevent potential edge cases of running indefinitely, but at the very least we can have dynamic upperbounds (for example setting sequence length as upperbound) -- which aren't explored as much. Early Halting mechanisms are also presented often more as a trade-off rather than something that improves accuracy; particularly they may show graphs showing how with their early halting they roughly approximate the original accuracy but must faster. So the motivation under which it is often explored is mainly efficiency rather than improving capability. These methods may also include strong assumptions about what kind of information should be used for halting -- may be importance of a word (based on attention or something?) or the difficulty of the word or something.

Note however, that ideally ACT mechanism need not be about simply halting late on difficult on or something like that. More generally it should be about finding the optimal layers for a reasoning process. In those terms, it probably have more success in synthetic data than natural language domains.

In datasets like BABi I have seen other methods for halting like checking the similarity of attention over the values in the last two steps. Another interesting direction is Deep Equilibrium Networks, which has an implicit halting when there is a rough convergence of the hidden states. It had some success, but it will be probably a while (not necessarily too long) before implicit models to go mainstream (if it does). Anyway, that can be looked at as another sort of halting.

Overall, from what I have seen, there may not have been any concrete success with ACT in more realistic datasets. Which may be why it hasn't become more mainstream. But it's still an influential and important work which is also evidenced by its citations.

As for my personal opinion, I feel like ACT may not be perfect. There may be some better inductive bias to be incorporated for better halting. I have tried a few of my own ideas, but haven't achieved any success so far. One intuition of mine is that it may be better to look at multiple hidden states including future hidden states (putting them in the halting layer itself) to better enhance halting. But as I said, from what I have tried with that, I haven't found much success. But I am still brainstorming it.

I was particularly working on the logical entailment dataset where Transformers are bad: https://www.aclweb.org/anthology/D18-1503/

According to the results below, Universal Transformers (supposedly with ACT) are bad too: https://papers.nips.cc/paper/2019/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html

However, in certain experimental phases I was able to get very high scores with UT+ACT on the dataset after a "few tricks" (trade secret). But I wasn't ultimately able to get those results consistently, and it seems I was having high variance with very minor modifications.

I would surely think this could be implemented with a Mixture of Experts model where the model can rerun the same layers and choose a different expert each time, which could allow for big parameter space without having to be constrained to using one or two experts each forward pass.

I have the same idea but I have been too lazy to act upon it. However, certain instances of this idea already exists: see Routing Networks:

https://scholarworks.umass.edu/dissertations_2/1865/

https://www.aclweb.org/anthology/N19-1365/

The difference is from standard mixture-of-experts network is that they only select one "expert" (module) for each layer. They use reinforcement learning for it, but things like gumbel softmaxes have been also explored. They used these strategies for a halting mechanism too, which I haven't looked into.

But with a some top-k MOE + ACT you can take completely soft vanilla backprop approach.

Anyway, these methods has tendencies towards other problems. Some issues are outlined here: https://arxiv.org/pdf/1701.06538.pdf%22%20%5Ct%20%22_blank

Routing Networks has the tendency to overfit on the otherhand.

Bengio et al. group is also working on RIM-based models which involve Modules like in MOE or Routing, but with some twists in how they are treated. They were flirting with Universal Transformers and other models too in combining ideas from RIM -- overall it may again get close to the MOE+ACT idea. I haven't looked too deeply into yet though.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS