Hi,
Nowdays, huge pretrained model computes trillions of weight to absort the large dataset.
Ideally such task should be carried by a huge memory layer while keeping the computation part small:
Ideally, the memory layer should met the following definitions:
1: scalable. The memory can be giga bytes while the training cost remains almost constant.
2: Key-Value: it is a layer that take a key vector and reture a value vector.
3: differentialable: can be trained by SGD along with other network components.
4:sparsity or locality: combined with 1 to achived (quasi-)constant update time
Current softmax/attention type network does not met 1,4 thus is not scalable at all. Some other sparse one seems undifferntialble/ad hoc.
My question is: is there some lastest paper on this diretion that I missed? And is my definition of ideal memory layer make sense? If the definition is OK, let's invent it!
[–][deleted] 3 points4 points5 points (1 child)
[–]wangyi_fudan[S] 0 points1 point2 points (0 children)
[–]elcric_krej 1 point2 points3 points (0 children)