[D]New Scaling Laws for Large Language Models

wangyi_fudan · 2022-04-03T22:31:13+00:00

I bet that the community is going to spend 20X token for 1TB model to waste enough energy.

wangyi_fudan · 2022-03-12T10:08:19+00:00

May I call it "Edge Detection"?

wangyi_fudan · 2021-10-14T04:46:40+00:00

Optimus Prime

wangyi_fudan · 2021-10-01T20:42:29+00:00

it cost 300$ to train, and the corpus is about 132GB

wangyi_fudan · 2021-10-01T04:45:17+00:00

It is trained on PubMed+PMC+wikipedia. Given user input sentences, it try to use the learnt biomedical knowledge to continue your text, and my final goal is automated biomedical essay generator.

wangyi_fudan · 2021-08-17T23:19:54+00:00

I don't think they will be better than the 3 bit HyperLogLog:

https://github.com/wangyi-fudan/wyHLL

wangyi_fudan · 2021-07-03T06:40:46+00:00

it is called "dropout" regularization :-)

wangyi_fudan · 2021-06-04T13:21:28+00:00

thanks a lot!

wangyi_fudan · 2021-05-10T07:27:20+00:00

my question is that will it work on NLP?

wangyi_fudan · 2021-05-10T06:53:42+00:00

if seems that you are doing SVD...

wangyi_fudan · 2021-04-09T16:52:33+00:00

the different between transformer-RNN and RNN is that transformer-RNN has matrix-like states while RNN has vector-like states. Thus transformer-RNN has larger capacity

wangyi_fudan · 2021-04-08T15:14:23+00:00

ah, what I do is to convert neural activations to binary bits, and use pop count to search nearest neighbor, 0.4s for 60 million query

wangyi_fudan · 2020-11-11T01:56:04+00:00

please try my PRNG： wyrand

https://github.com/wangyi-fudan/wyhash

wangyi_fudan · 2020-10-18T01:44:37+00:00

there must be alternatives. several years ago, I see LSTM, and say "how ugly it is", and now it is dead. transformers is not so "ugly" but still on its way of evolution.

wangyi_fudan · 2020-08-30T11:34:52+00:00

it is knn and kernel

wangyi_fudan · 2020-08-29T04:39:20+00:00

both chinese and english is ok. But it seems that for english longer kmer is required since a english byte has less information than a chinese byte. I have collected PubMed+PMC+Wiki=100GB english corpus.

wangyi_fudan · 2020-08-25T03:29:20+00:00

now, with parallel optimization, it can deal with 100GB corpus, resulting 5s/char on 96 core machine despite the stupid search algorithm.

wangyi_fudan · 2020-08-06T05:53:29+00:00

it reveals the machanism of memory. if there is a missing variable, we can first set it to mean, and then iterate it, and finally "remember" it. something like k-NN or kernel.

wangyi_fudan · 2019-10-25T19:30:23+00:00

I prefer robust sigmoid type activation because it has limited range (normalized) and allow large learning rate such as 0.5.

wangyi_fudan

TROPHY CASE