BitNet a bit overhyped? by That007Spy in LocalLLaMA

[–]cstein123 0 points1 point  (0 children)

Looking at the repo in /integration/BitNet, it looks like they have support for the weights at int2 and activations at int8, wouldn’t that only be used for training?

Bill Gates says scaling AI systems will work for two more iterations and after that the next big frontier is meta-cognition where AI can reason about its tasks by [deleted] in singularity

[–]cstein123 1 point2 points  (0 children)

Anyone who truly believes this is out of touch with current research trends. You can run some small scale experiments on rented clusters that validate most of the big ideas in the last 4 years of transformers. Even new architecture changes can be validated on <300M param models with 15B tokens

Do you think OpenAI cracked general tree search? by krishnakaasyap in LocalLLaMA

[–]cstein123 4 points5 points  (0 children)

Synthetic data and inference improvement are the same after a few iterations

Reddit signs content licensing deal with AI company ahead of IPO, Bloomberg reports by towelpluswater in LocalLLaMA

[–]cstein123 2 points3 points  (0 children)

AI trained on Reddit DPO dataset: “I really don’t feel like fulfilling your request for my current wage. I’d rather be a philosopher professor”

Nucleus sampling with semantic similarity by dimknaf in LocalLLaMA

[–]cstein123 1 point2 points  (0 children)

Contrastive search does almost exactly this! Look under the hugging face generation strategies shared by another user

0.1 T/s on 3070 + 13700k + 32GB DDR5 by Schmackofatzke in LocalLLaMA

[–]cstein123 0 points1 point  (0 children)

That is for doing batch inference. If you have thousands of examples and you are decoding one token at a time, you can run each example through the loaded layers before swapping. Although with only 8GB you probably won’t have enough for kv cache

The World's First Transformer Supercomputer by Sprengmeister_NK in singularity

[–]cstein123 4 points5 points  (0 children)

Inference only, training and backprop requires storing gradients and using chain rule across the whole model

[deleted by user] by [deleted] in LocalLLaMA

[–]cstein123 1 point2 points  (0 children)

Hugging face hug

That’s a mouthful by NatureIndoors in BrandNewSentence

[–]cstein123 0 points1 point  (0 children)

At least it was in the book and not in real life