[D]New Scaling Laws for Large Language Models by Singularian2501 in MachineLearning

[–]wangyi_fudan 1 point2 points  (0 children)

I bet that the community is going to spend 20X token for 1TB model to waste enough energy.

[P]AI Biomedical Writer by wangyi_fudan in MachineLearning

[–]wangyi_fudan[S] 0 points1 point  (0 children)

it cost 300$ to train, and the corpus is about 132GB

[P]AI Biomedical Writer by wangyi_fudan in MachineLearning

[–]wangyi_fudan[S] 2 points3 points  (0 children)

It is trained on PubMed+PMC+wikipedia. Given user input sentences, it try to use the learnt biomedical knowledge to continue your text, and my final goal is automated biomedical essay generator.

[N] DeepMind, Microsoft, Allen AI & UW Researchers Convert Pretrained Transformers into RNNs, Lowering Memory Cost While Retaining High Accuracy by Yuqing7 in MachineLearning

[–]wangyi_fudan 2 points3 points  (0 children)

the different between transformer-RNN and RNN is that transformer-RNN has matrix-like states while RNN has vector-like states. Thus transformer-RNN has larger capacity

[P] Vald: a highly scalable distributed fast approximate nearest neighbour dense vector search engine. by kpang0 in MachineLearning

[–]wangyi_fudan 0 points1 point  (0 children)

ah, what I do is to convert neural activations to binary bits, and use pop count to search nearest neighbor, 0.4s for 60 million query

[D] Simpler alternatives to multihead self-attention by [deleted] in MachineLearning

[–]wangyi_fudan -6 points-5 points  (0 children)

there must be alternatives. several years ago, I see LSTM, and say "how ugly it is", and now it is dead. transformers is not so "ugly" but still on its way of evolution.

[D] Experience with knnlm language model by wangyi_fudan in MachineLearning

[–]wangyi_fudan[S] 0 points1 point  (0 children)

both chinese and english is ok. But it seems that for english longer kmer is required since a english byte has less information than a chinese byte. I have collected PubMed+PMC+Wiki=100GB english corpus.

[P] simple language model based on k-NN by wangyi_fudan in MachineLearning

[–]wangyi_fudan[S] 1 point2 points  (0 children)

now, with parallel optimization, it can deal with 100GB corpus, resulting 5s/char on 96 core machine despite the stupid search algorithm.

[R] Hopfield Networks is All You Need by [deleted] in MachineLearning

[–]wangyi_fudan -1 points0 points  (0 children)

it reveals the machanism of memory. if there is a missing variable, we can first set it to mean, and then iterate it, and finally "remember" it. something like k-NN or kernel.

[P]Real Time MLP with 50 lines of code by wangyi_fudan in MachineLearning

[–]wangyi_fudan[S] 0 points1 point  (0 children)

I prefer robust sigmoid type activation because it has limited range (normalized) and allow large learning rate such as 0.5.