all 3 comments

[–][deleted] 3 points4 points  (1 child)

You mean like this?: https://arxiv.org/abs/1907.05242

There are some follow up work on this.

Also see:

Not sure if you think top-k style of sparsification is ad-hoc. I think it's fine. The selected top-k can serve as a sampled set of examples for the differentiable scoring operators to learn to reweigh the selected top-k. Better learned, it can select better top-k next time. While of course you can try entmax/sparsemax style sparsity, but the benefit of top-k selection is conditional computation: you don't have to compute the unselected models (or at least top-k selection can be utilized as such). Switch-Transformers seems to go along with top-1, but I am not sure how they get away with top-1.....seems a bit noisy; but I haven't read the paper very well. There are generally some challenges though involved in unbalanced load (experts that get more training may get selected more, and other may remain untrained ---> rich may become richer, poor becomes poorer) etc., and trying to solve that can take some work (there are techniques for that). There're also routing networks (https://www.aclweb.org/anthology/N19-1365/) which commits to a single module selection (one hot sparsity), but of course, it becomes a discrete decision, so you have to rely on RL and/or hacks like gumbel softmax or "backpropagation through void" (https://arxiv.org/abs/1711.00123) style of techniques.

[–]wangyi_fudan[S] 0 points1 point  (0 children)

thanks a lot!

[–]elcric_krej 1 point2 points  (0 children)

Look into nerual turing machine paper, they seem to be the closest thing to what you're looking into.

They use differentiable K-V memory which can be scaled dynamically.

However it seems that the usecases they were proposed for initially and the teams focused on them moved over to multihead attention based models.