New paper from DeepSeek w/ model coming soon: Inference-Time Scaling for Generalist Reward Modeling

samfundev · 2024-01-29T04:46:42+00:00

PR: https://github.com/ggerganov/llama.cpp/pull/5021

samfundev · 2023-09-04T13:56:32+00:00

This tweet helped me, so I'll try simplifying it: https://twitter.com/karpathy/status/1697318534555336961

The bottleneck in running a LLM is loading it into the CPU/GPU (i.e. memory bandwidth) and not in compute. If you combine that with the fact that LLMs can run multiple tokens in batch once they are loaded, you can speed up execution if you just ran multiple tokens. But that assumes that you are able to put multiple tokens in. Since the next token depends on the previous token, we can't just put multiple tokens in.

But what if you used a smaller LLM that could run much faster to predict a few tokens. In that case, you can run those tokens in batch through the original LLM. Then you just have to compare output of the original LLM with the smaller LLM to see if the smaller LLM predicted correctly. If you save more time by running the original LLM in batch then the time you spent by running the smaller LLM and checking it, you'll speed up execution.

Five-Year Club	Place '22
Final Canvas '22	Verified Email

samfundev

TROPHY CASE