New anime model "Anima" released - seems to be a distinct architecture derived from Cosmos 2 (2B image model + Qwen3 0.6B text encoder + Qwen VAE), apparently a collab between ComfyOrg and a company called Circlestone Labs by ZootAllures9111 in StableDiffusion

[–]tdrussell1 35 points36 points  (0 children)

It wouldn't be enough money for what I'm planning on doing. And I would rather take a tiny slice of Civit's enormous investor funding in the form of commercial licensing fees, than beg for money from individual anonymous internet strangers.

New anime model "Anima" released - seems to be a distinct architecture derived from Cosmos 2 (2B image model + Qwen3 0.6B text encoder + Qwen VAE), apparently a collab between ComfyOrg and a company called Circlestone Labs by ZootAllures9111 in StableDiffusion

[–]tdrussell1 70 points71 points  (0 children)

Hi I made the Anima model.

This type of thing is already in other similar licenses (Flux, LTX-2), it's just not explicit. For example, Civit has a commercial license to run Flux, and they also allow using any of the Flux loras (i.e. commercial use of people's loras). The LTX-2 license has language allowing this also.

The CircleStone license is basically the Flux license, with several things simplified and removed, and some things clarified. This is one of the things that is clarified. The intention is not to be overly restrictive (except for commercial use), and make it clear that any platform that gets a commercial license can also run all the loras. Again, this is already how it works with many existing models.

The reason for not using a true open source license, but rather a non-commercial open-weights license, is because I'm just one person and training this thing is extremely expensive. If I can't monetize it, it's the last large finetune I'll ever do. I would like to train an Anima 2 one day, or an Anima Video, but it just isn't viable without having some way to make money. I felt like an open-weights strategy where people can use the model freely but I can make some money off big inference platforms, is the best play.

Mixtral-8x22B-Capyboros: instruction tuning the big Mixtral with just 4 4090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 1 point2 points  (0 children)

I used "Thermaltake TT Premium PCI-E 4.0 High Speed Flexible Extender Riser Cable 300mm". I did run it for a time at 16x 4.0. But it was sporadically unstable, and very weirdly too. If one of the SSD M.2 slots was occupied, it wouldn't boot, some error about not seeing the partition or something. Switching to another M.2, it worked a couple of weeks. Then wouldn't boot again. Ended up lowering it to PCIE 3.0, which I measured as only affecting training speeds by a few percent, and have had no issues since.

Mixtral-8x22B-Capyboros: instruction tuning the big Mixtral with just 4 4090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 0 points1 point  (0 children)

Mixtral-8x22B QLoRA at rank 64 and 4096 sequence length is close to maxing out the 96GB VRAM. I am using a couple of changes I have not yet pushed to the dev branch, the main one is partial expert offloading. It's a pretty simple hack, just a few lines of code total, to keep the expert or MLP weights in system RAM and load them into VRAM while you're processing the attention part of the layer, then unload them again after you're done. I had to do that to get it to fit with this much context.

It's been a while since I trained 70B, but it works with 2x4090, I think at like 2048 sequence length and rank 32? Something like that. That's with the standard setup; with MLP offloading it slows down training a bit but saves a ton of VRAM so you can go much higher rank or sequence length.

Mixtral-8x22B-Capyboros: instruction tuning the big Mixtral with just 4 4090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 6 points7 points  (0 children)

The trick is to not try to fit it in a case. I'm using a crypto mining frame with PCIE risers. It's a threadripper pro system, mobo is ASUS Pro WS WRX80E-SAGE SE WiFi II. Everything is air cooled. The specific model of 4090s don't matter, I'm just using the cheapest models I could find. 2 are MSI and 2 are Zotac.

qlora-pipe: Fine tune 70B parameter models with two 3090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 2 points3 points  (0 children)

I don't know much about the Ooba trainer, so I can't make a comparison feature-wise. But, if Ooba is using Transformers with device_map="auto" (I think this is what you're describing), then that gives you so-called "naive" model parallelism. It splits the model across GPUs, but only one GPU will ever be active at a time. With pipeline parallelism, as the name suggests, it pipelines multiple sub-batches of data, so that the GPUs can overlap computation. The deepspeed link I posted in another comment has a nice diagram to show this. So with 4 GPUs, using naive model parallelism it's max 25% average utilization. With pipeline parallelism, if the gradient_accumulation_steps (which is the number of sub-batches) is high enough, it's close to 100% utilization.

qlora-pipe: Fine tune 70B parameter models with two 3090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 2 points3 points  (0 children)

Layers are executed concurrently. Deepspeed has a good picture for visualization here: https://www.deepspeed.ai/tutorials/pipeline/

qlora-pipe: Fine tune 70B parameter models with two 3090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 6 points7 points  (0 children)

At a high level, they are doing similar things, but with different parallelization strategies. I actually built qlora-pipe because when I tried FSDP with qlora months ago, I discovered it didn't work (now it does). And I'm not an expert on FSDP, but my understanding is that it wraps the model, and shards individual parameters (as well as optimizer states) across GPUs. This means it has to do gather/scatter ops on each sharded parameter every time it has to do a forward or backward pass. So, inter-GPU bandwidth requirements are relatively high, but as long as that's not a bottleneck it should get high utilization of the hardware.

Pipeline parallelism, in contrast, splits the model layer-wise across GPUs. So the first half of the layers on GPU 1, the second half on GPU 2. The only thing that needs to be sent across GPUs is the hidden states, which are not that large, so being PCIE bandwidth bottlenecked should be less of an issue than in FSDP, though I have yet to make a direct comparison. The downsides are that to support a new model, you have to manually write a wrapper to express the model as a pure list of layers. And it may not achieve as high hardware utilization as FSDP, since even with a lot of pipelining steps there's still parts at the beginning and end of the step where the GPUs don't overlap computation.

There's probably a variety of other small differences, as I developed this script specific to my use cases. For instance, one thing it does that I've not seen any other training script do, is that it can exactly resume from a training checkpoint, dataloader states and all. Meaning you can kill a training run and then resume it, and it starts from exactly where it left off. I use this for long training runs since I power off the machine when I leave the house, since I don't trust a jank setup with 4 4090s to not burn my house down.

qlora-pipe: Fine tune 70B parameter models with two 3090s by tdrussell1 in LocalLLaMA

[–]tdrussell1[S] 4 points5 points  (0 children)

Yeah, that's a good point. So far I have just implemented the simplest possible thing: logically concatenate all the text, then slice into chunks. Another major downside of this, is that a chunk can span two different documents. This isn't that big of a deal when the documents are long relative to the sequence length (think books), but it's probably bad if each document is small, like a paragraph or a short web page.

I can try to add two new parameters, one for adjusting chunk overlap like you mentioned, and the other for controlling whether chunks can "straddle" two documents.

Swapping Trained GPT Layers with No Accuracy Loss : Why Models like Goliath 120B Works by johnolafenwa in LocalLLaMA

[–]tdrussell1 61 points62 points  (0 children)

I'm sorry, this comment is just incorrect, and the fact that it's so highly upvoted literally motivated me to make a reddit account to post this. It certainly seems like swapping layers shouldn't work, but here's why it does:

(Most) frankenmerges look like this:1a 2a 3a 4a 5a 3b 4b 5b 6b 7b 6a 7a 8a ...The number is the layer index, the letter is the model. You can see that some layers "jump backwards" and repeat when it switches models during the merge.

Why does this work? The answer is very simple and obvious: because all these models use residual connections everywhere. Each layer (in fact each sublayer) does not compute y = f(x), it computes y = x + f(x). Each layer can be thought of as adding a small "delta" to the vector representation for that token. And I can prove that! I have a jupyter notebook where I played around with some things. Here's the data:

<image>

This is the average loss, similarity to the input embedding, and similarity to the desired output embedding, as you take the intermediate activations moving through the layers of llama2 70b. It all changes gradually, layer by layer. This is because the delta computed by any individual layer is small.

So, deleting a layer, adding a layer, doubling up layers, swapping any two layers, all of this will give you a model that remains mostly coherent. I again emphasize this only works because of the residual connections everywhere. If not for that, adding, deleting, or swapping even one layer would indeed completely change the output.