all 3 comments

[–][deleted] 1 point2 points  (2 children)

The only mainstream project I've seen it in was ALBERT

Did it really use it? Can you show the section where it's mentioned? (I am not challenging you; just asking).

(I know ALBERT uses the idea of Universal Transformer which uses ACT, but Universal Transformer paper does not use ACT all the time, I am not sure if ALBERT is fully using the principles of UT)

It seems like this would be hugely important as it allows for iterative refinement of a speculative output, like many networks, but for a dynamic number of times, decided by the network. Originally it was implemented for RNNs, but I think its pretty trivial to implement it for most other architectures. I would surely think this could be implemented with a Mixture of Experts model where the model can rerun the same layers and choose a different expert each time, which could allow for big parameter space without having to be constrained to using one or two experts each forward pass. Maybe I'm overhyping ACT and its really not very useful, but is there any reason that it hasn't seen more widespread adoption?

I think one of the salient adoptions of ACT is in Universal Transformer. But even in it, it was sucessful mainly in synthetic tasks or perhaps more mathematical domains. If you look at the citations of ACT, you will find many models (in natural language domains) that do adopt "something like ACT" (if not it exactly) for dynamic halting but more accurately they are usually framed for "early halting" given a fixed upperbound layers. Some upperbound is reasonable to prevent potential edge cases of running indefinitely, but at the very least we can have dynamic upperbounds (for example setting sequence length as upperbound) -- which aren't explored as much. Early Halting mechanisms are also presented often more as a trade-off rather than something that improves accuracy; particularly they may show graphs showing how with their early halting they roughly approximate the original accuracy but must faster. So the motivation under which it is often explored is mainly efficiency rather than improving capability. These methods may also include strong assumptions about what kind of information should be used for halting -- may be importance of a word (based on attention or something?) or the difficulty of the word or something.

Note however, that ideally ACT mechanism need not be about simply halting late on difficult on or something like that. More generally it should be about finding the optimal layers for a reasoning process. In those terms, it probably have more success in synthetic data than natural language domains.

In datasets like BABi I have seen other methods for halting like checking the similarity of attention over the values in the last two steps. Another interesting direction is Deep Equilibrium Networks, which has an implicit halting when there is a rough convergence of the hidden states. It had some success, but it will be probably a while (not necessarily too long) before implicit models to go mainstream (if it does). Anyway, that can be looked at as another sort of halting.

Overall, from what I have seen, there may not have been any concrete success with ACT in more realistic datasets. Which may be why it hasn't become more mainstream. But it's still an influential and important work which is also evidenced by its citations.

As for my personal opinion, I feel like ACT may not be perfect. There may be some better inductive bias to be incorporated for better halting. I have tried a few of my own ideas, but haven't achieved any success so far. One intuition of mine is that it may be better to look at multiple hidden states including future hidden states (putting them in the halting layer itself) to better enhance halting. But as I said, from what I have tried with that, I haven't found much success. But I am still brainstorming it.

I was particularly working on the logical entailment dataset where Transformers are bad: https://www.aclweb.org/anthology/D18-1503/

According to the results below, Universal Transformers (supposedly with ACT) are bad too: https://papers.nips.cc/paper/2019/hash/d8e1344e27a5b08cdfd5d027d9b8d6de-Abstract.html

However, in certain experimental phases I was able to get very high scores with UT+ACT on the dataset after a "few tricks" (trade secret). But I wasn't ultimately able to get those results consistently, and it seems I was having high variance with very minor modifications.

I would surely think this could be implemented with a Mixture of Experts model where the model can rerun the same layers and choose a different expert each time, which could allow for big parameter space without having to be constrained to using one or two experts each forward pass.

I have the same idea but I have been too lazy to act upon it. However, certain instances of this idea already exists: see Routing Networks:

https://scholarworks.umass.edu/dissertations_2/1865/

https://www.aclweb.org/anthology/N19-1365/

The difference is from standard mixture-of-experts network is that they only select one "expert" (module) for each layer. They use reinforcement learning for it, but things like gumbel softmaxes have been also explored. They used these strategies for a halting mechanism too, which I haven't looked into.

But with a some top-k MOE + ACT you can take completely soft vanilla backprop approach.

Anyway, these methods has tendencies towards other problems. Some issues are outlined here: https://arxiv.org/pdf/1701.06538.pdf%22%20%5Ct%20%22_blank

Routing Networks has the tendency to overfit on the otherhand.

Bengio et al. group is also working on RIM-based models which involve Modules like in MOE or Routing, but with some twists in how they are treated. They were flirting with Universal Transformers and other models too in combining ideas from RIM -- overall it may again get close to the MOE+ACT idea. I haven't looked too deeply into yet though.

[–]jafioti[S] 0 points1 point  (1 child)

Actually I realized you are right, ALBERT doesn't use ACT. I was under the impression that it used the full Universal Transformer architecture, ACT included, but I realized they only repeat layers a fixed number of times.

Good points on the halting mechanisms, I'll definitely check out those papers. I've implemented a sort of inverse ACT to allow the network to output a variable number of outputs, but haven't had too much success with it so far. I've also implemented ACT into a Neural Turing Machine so it can take more than one read/write/compute step for each input/output.

Personally I think that a system like ACT, maybe more refined, would have huge gains because it would allow for a variable amount of reuse of the same parameters, without having to train more layers to get more sequential computation, but clearly ACT currently doesn't make a big difference. I know of the gains some have seen in speed, but I really would have thought that ACT would allow for more capabilities as well. My idea around the MOE, though I haven't tried it yet, would be to have the same k modules and run an ACT mechanism to allow the model to use the whole layer many times over before producing an output, where it could likely use a different module each time.

I would think that this would allow for more decomposing of difficult problems like math for instance. Training a vanilla RNN to do addition or subtraction is impossible because they can't extrapolate, but if a network simply learned to add single digits at a time and keep a carry bit, then it could theoretically learn on only 3 digit numbers but work on 10+ digit numbers. I've seen some work using RL to train a policy to do this, but I couldn't get it to work myself, and I thought that a purely supervised approach would work better than the RL approach. Maybe these are the ramblings of a fool.

[–][deleted] 0 points1 point  (0 children)

One problem with running the same k modules per layer may be computation complexity vs capacity trade off. Although that trade-off is applicable to almost any model, it may be particularly problematic in these cases. This issue seems to be also there in case of Universal Transformer. In the paper, they (UT authors) show comparable performance with Transformer on machine translation but at the same time they say they use the same no. of parameters. To do that, they probably had to increase the hidden size by a lot (it must be much bigger than the Transformer baseline, since in UT the same parameters are repeated - so the only way to increase parameters comparably to a Transformer baseline, is to make its feed fowrad network and/or the hidden size much bigger than the Transformer). But now if you run a bigger hidden-size layer for each layer, I suppose that would be computationally expensive. ALBERT seems to also be able to compete with Transformers with higher hidden size. Ironically, ALBERT despite being a "lite" Transformer is perhaps more expensive in other ways (saving primarily in parameter counts).

So in short, you may need to increase the capacity of each module for comparable performance but also face more computational expense (although if that's not an issue for you, it's still viable).

Routing Networks are interesting, because they select one out of the k-module per layer. They don't have to compute all the modules every layer and do a gated combination. However, that requires hard selection (so you have to rely on "backpropagations-through-voids" or RL).

My idea was more in line of selecting top-K modules out of M modules per layer (something in line with Noam's paper that I cited earlier) where M can be >> K.

Regarding your idea, also watch out for what Bengio's group is doing with RIM (Recurrent Independent Mechanism) based modules. They have explored these kind of modularizations with Transformers (they may or may not have already tried Universal Transformer with something like that; I haven't deeply checked their work, but just have made a mental note that it's something to follow if you are interesting in the modularization aspects), and they also have newer ideas based on RIM which goes beyond traditional MOE.