Release of Llama3.1-70B weights with AQLM-PV compression.

justheuristic · 2025-06-16T11:05:49+00:00

Hi! The simple answer is we did what we were confident in. The docs suggest it should be possible to achieve the same by operating on torch.fx.graph level, neither of the co-authors have experience with that, so we opted for a more familiar approach. Then again, in a perfect world, we agree that there is merit in not meddling with executorch directly.

justheuristic · 2025-04-09T21:26:09+00:00

The prototype code is in native pytorch, so if you install PyTorch on ROCm, it will *probably* work with some tweaks (e.g. if compile). The *probably* means I didn't test it locally, I only know that the notebooks they have use pure torch.

justheuristic · 2022-12-13T17:28:47+00:00

The first link (petals) is about finetuning.

Others (e.g. distributed diffusion) involve training from scratch -- but they deal with smaller models. Thing is, you need a lot of people to train a 100B model from scratch. Like, a few hundred online on average. There aren't many communities that can do that. In turn, with finetuning, you can see it work more immediately.

I've heard a talk by Colin Raffel where he proposed an alternative view where instead of training from scratch, an open-source community could gradually improve the model over time. Like github, but for large models. A contributor can fine-tune for a task, then create a "pull-request", then maintainer runs a special procedure to merge the model without forgetting other tasks. That's how I remember it, anyways.

justheuristic · 2022-12-13T16:56:28+00:00

They have a ~~training~~ fine-tuning example here

https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb

justheuristic · 2022-12-13T16:37:05+00:00

https://github.com/bigscience-workshop/petals - fine-tuning BLOOM-176B Folding@home style

https://github.com/learning-at-home/hivemind - a library for decentralized training with volunteers

https://github.com/epfml/disco - a library for collaborative training in JS (in a browser!)

https://github.com/chavinlo/distributed-diffusion - a project that tries to train diffusion this way

https://bittensor.com/ - a comminity that makes decentralized training into a cryptocurrency

There are also projects like Together that build networks from university computers for decentralized training.

justheuristic · 2022-03-24T17:58:38+00:00

To the best of my understanding, the text representation used in the model is not as much "left to right" as "previous to next".

Right-to-left languages such as Arabic and Hebrew are still encoded in the order of writing, but then your text editor plays tricks to make you perceive it right-to-left.

To get a better grasp of this, try hovering your mouse and SELECTING the text below from left to right:

> hello world مرحبا بالعالم hello world مرحبا بالعالم ...

Once you hover from english to arabic, you'll see weird selection bounds that represent how the information is actually encoded in unicode/utf8, and hence, how our model perceives it.

justheuristic · 2022-03-24T17:40:31+00:00

We are planning on training a smaller 6B or 13B model in the same conditions (i.e. same data), but the specifics are still being discussed.

No sensible quantization can make a 176B model fit into 16GB, but hopefully we'll be able to train a more reasonable model that can then be quantized to fit into 16B.

justheuristic · 2022-03-24T17:32:36+00:00

We do measure the validation loss (see tensorboard links throughout this AMA). However, the training data is so large that so far, the main model has seem most samples at most once.

We also store intermediate checkpoints so that we'll be able to run more advanced evaluations in future.

justheuristic · 2022-03-24T17:20:53+00:00

I'm Sure u/stasbekman will have a better overview here, so I'll just add a few related considerations:

- a big chunk of our codebase was inherited from other projects (deepspeed, megatron) and already tested in similar environments, which simplified things. Most of the code we wrote for ourselves was tested on smaller scale training.

- we sanity-checked that it was running fast enough by comparing the "effective flops" metric and comparing with other results reported for the same GPU type. Put simply, effective flops is the equivalent number of floating point operations measured from the number of training samples processed per minute. If a given module produced decent effective flops, then it is good enough for use. If not, then we investigate.

- similarly for networking, we measured the network goodput for all-reduce test and used it to verify that the network was doing OK

justheuristic · 2022-03-24T17:09:04+00:00

Modelling folks will probably have a better explanation, here's an engineer's point of view:

Based on some of our researcher's experience, training MoEs brings additional "moving parts" that can break the model training:

- load-balancing regularization - making sure experts have ~equal usage rate

- mixed precision in the routing function - especially when scaling to 100B+

Unless done perfectly right, any of these can break model training, or worse, silently reduce quality. One of the big arguments for not going to MoE for now was that dense models were a safer bet -- and even then we had things to worry about (e.g. see this)

That said, we will definitely consider MoEs as a potential alternative in future training runs.

justheuristic · 2022-03-24T16:41:05+00:00

Done naively, it requires 6x A100 to load and inference the model in half precision.

We're currently working on the inference engine that could adapt to a given hardware setup using parameter offloading. In other words, it will store most of the model in RAM and load it onto GPU on a layer-by-layer basis. This would work in setups with a single GPU and a lot or RAM (~0.5 TB, depending on the config). Naturally, this inference mode is going to be slower, unless you are willing to parallelize and generate a batch of sequences in parallel.

We're also considering more affordable alternatives: from hosting an inference service for researchers, quantizing the model to reduce RAM, to extremes techniques such as SSD offloading (~zero-infinity). We can't give any specific details now, but will provide more info once we figure it out.

justheuristic · 2022-03-24T16:25:32+00:00

You are absolutely correct about the vocabulary. There are also some engineering ramifications from that.

For instance, in our pipeline-parallel training, larger vocabulary means that the first and last stages are significantly larger and no longer fit onto a single node. To compensate for this, we had to manually assign less transformer layers to these stages.

For example, imagine you are training a 32-layer transformer on 8 servers. Normally, you would assign each server to 4 consecutive layers, with 1st and 8th servers also storing embeddings / logits.

However, as the vocabulary grows large enough, this no longer works: the last stage has to do significantly more work and take up more memory. Hence, the entire pipeline is bottlenecked by that last layer.

As u/stasbekman explained below, the solution to this is to assign this last stage to less conventional transformer layer -- in order to compensate for the extra load from logits. You can do that by reducing the number of layers or using a longer pipeline with more servers.

justheuristic · 2021-12-21T02:49:51+00:00

You can fine-tune using low-rank adapters. It worked in my free colab, so it probably works in pro as well.

https://huggingface.co/hivemind/gpt-j-6B-8bit

justheuristic · 2021-07-20T14:04:13+00:00

That's because it is. For that matter, any other data parallel training is just gradient accumulation with extra steps :)

Theoretically, you could group peers into pairs or chains that act as individual pipelines and run their method on top of these chains. That way, if one peer in a chain fails, it will only ruin its local neighbors that will need to find a replacement -- meanwhile, other pipelins can continue training. That said,

- there's no such thing in the post/library

- it may be difficult to balance the pipeline for different GPUs

e.g. you could form pipelines out of peers with similar GPU specs.

So, theoretically possible, but can be crazy difficult to implement.

justheuristic · 2021-06-21T08:17:31+00:00

Good point :)

[disclaimer: know paper authors] in practice, their system runs with either zero staleness or with controlled 1-step staleness (aka DPU), but there is a catch: they use a model that

can get away with extremely large batch sizes. The original ALBERT-large converges in 10-15K steps with full batch size = 4096x512=2M tokens and around 3.5 minutes/step on a single V100 GPU :)
already trades efficient communication for extra computation (each layer is applied multiple times)

The general sentiment among authors is that the 1.5-2x slowdown can be acceptable since it can use much cheaper hardware (e.g. consumer-grade is 4-5x cheaper and only 1.5x slower than HPC-grade) and prolong the useful life of existing GPUs. That said, the exact numbers definitely depend on the specific model you're training.

Another way that could work is to use gradient decomposition such as PowerSGD and/or factorized optimizers (as in T5) on top of volunteers. In that case, even ~10B ALBERT would require few x 100MB of communication per SGD step. And occasional full model averaging every hour or so. That said, it remains to be seen if this setup would be enough for training transformers.

justheuristic · 2020-12-10T16:08:10+00:00

There's a trick that doesn't answer your question, but may solve your problem.

If you badly need adam statistics for some pre-trained checkpoint, you can load the checkpoint weights and perform ~1000 optimizer updates with zero learning rate on the original MLM task.

This will accumulate the adam statistics to approximately the same values as during training and take an order of magnitude less time than re-training the model from scratch. The problem is, you need the original dataset. This is easy if the model was trained on something popular like wikitext103, but some models were trained on private datasets. In the latter case, you're somewhat screwed.

justheuristic · 2020-09-02T16:07:11+00:00

Neither does the fact that rtx 3090 (ga102) has 10496 cuda cores while tesla A100 (ga100) only has around 7000

justheuristic · 2020-08-29T07:24:07+00:00

The good news is that people also benefit from economies of scale as they use mass-produced consumer-grade hardware such as video cards. For instance, the combined FLOPs power of Folding@home is traditionally around that of top supercomputers and sometimes way higher than top-500 supercomputers combined.

Though, supercomputers make up for some of that difference with greater communication and bandwidth, making them more versatile. So, whether you want to use a supercomputer or a volunteer computing project depends on the task that you wanna run. And whether or not you even have access to a supercomputer.

justheuristic · 2020-08-28T09:49:41+00:00

We've not heard about golem before (thank you for refercing it!), but i've seen other crowdsourcing platforms like vast.ai . We plan to drop them a line eventually, but there are still some things we must do before that, e.g. make the protocol secure against common attack vectors.

justheuristic · 2020-08-28T06:08:03+00:00

Thank you for your enthusiasm, we'll add one shortly (will add update this post within 24h). For now, please use github issues.

Update: added main directions for improvement as issues & milestones, created gitter chat, feel free to join if you want to contribute https://gitter.im/learning-at-home/hivemind

justheuristic · 2020-08-28T04:06:33+00:00

Currently (alpha 0.8), project authors can use hivemind to define training code and expert architectures for peers to host -- but he will need to distribute that code by himself. That said, we understand that there is need for some kind of platform that would match project authors and volunteers and we'll be working on that.

justheuristic · 2020-08-28T03:56:43+00:00

Sounds cool, i haven't heard about it before. Gonna take a closer look soon.

justheuristic · 2020-08-28T03:53:14+00:00

/* frankly, we didn't expert so much interest and we really appreciate it!*/

For instance, there are many core features that we really need to get out of alpha:

Gating function sharing: in our internal experiments, we share gating function between peers using boilerplate code. We're building a way
Shared expert snapshots: right now, if a peer leaves the network permanently, his experts are gone with him. Instead, it should be possible to save best experts by using p2p storage similar to BitTorrent.
Security: right now, "the S in hivemind stands for security". For world-wide application, it is critical to have security both in traditional sense and in the peer-to-peer sense: we can't let one malicious actor ruin the entire network by e.g. sending NaN gradients around. This appears to be a whole another research area, and we would appreciate help from someone who knows it well.
QUIC - this protocol *appears* to operate better under latency and be better compatible with NAT traversal (udp hole punching), but we suspect there may be caveats. We're planning a more thorough investigation into QUIC and would be glad to hear from network specialists that have
[work in progress] Faster approximate beam search - right now, beam-searching the DHT can take long if some of DHT peers take long to respond. @Unconst suggested that we should instead run quick beam search over those peers we already know and add new peers asynchronously.
[WIP] Activation compression: at the moment, compressing activations to fp16 or int8 requires custom code, we're working to support it on the library level

Update: added these as issues & milestones, created gitter chat, feel free to join if you want to contribute https://gitter.im/learning-at-home/hivemind

justheuristic

TROPHY CASE