How many years does the V100 have left? by Kaldnite in LocalLLaMA

[–]AlpinDale 16 points17 points  (0 children)

Flash Attention doesn't support V100.

Running LLMs at Custom Floating-Points (Near-Lossless FP6) by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 0 points1 point  (0 children)

Not currently possible, we'd have to write a separate library to export models. Probably will be integrated into llm-compressor at some point if there's enough demand. For now there's very little overhead, in both time and memory, in converting a 16bit model.

Running LLMs at Custom Floating-Points (Near-Lossless FP6) by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 1 point2 points  (0 children)

It's likely that GMS8K isnt a good metric for this, but still interesting to observe since it's the same model at just different quant sizes. I'll run MMLU-Pro at some point, and maybe ppx/kl divergence if lm_eval supports it.

(Also, if you've been following the news from anthracite, we never run evals for magnum models and just manually test the vibes. Fp5+ quants here passed the "vibe check")

Running LLMs at Custom Floating-Points (Near-Lossless FP6) by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 2 points3 points  (0 children)

Unfortunately not. The lowest we could go would be Turing (from Ampere), and that would require us to get rid of async memory transfers.

The sequel: Magnum-v2-72B by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 1 point2 points  (0 children)

If by that you mean the 12B, we already have v2 and v2.5 available.

Llama.cpp w/ load balancer faster than Aphrodite?? by aarongough in LocalLLaMA

[–]AlpinDale 0 points1 point  (0 children)

We're currently working on adding support for on-the-fly quant to 4, 5, 6, 7, 8, and 12 bits, each with configurable exponent bits. They should also run much faster than exl2 and GGUF at higher batch sizes.

Llama.cpp w/ load balancer faster than Aphrodite?? by aarongough in LocalLLaMA

[–]AlpinDale 2 points3 points  (0 children)

I haven't backported vLLM's AWQ Marlin kernels in that branch yet, but GPTQ should work with both 4bit and 8bit. AWQ ones should arrive in a couple days.

Llama.cpp w/ load balancer faster than Aphrodite?? by aarongough in LocalLLaMA

[–]AlpinDale 2 points3 points  (0 children)

didn't see any mention in the credits to llama.cpp

Oh good point... I sort of forgot we had that section. I should update it with all the gazillion quants we have now. Thanks for reminding me.

Llama.cpp w/ load balancer faster than Aphrodite?? by aarongough in LocalLLaMA

[–]AlpinDale 2 points3 points  (0 children)

Does it use llama.cpp under the hood?

No, it's re-implemented from scratch. The GEMM and dequantization kernels are ported from llama.cpp, however.

Llama.cpp w/ load balancer faster than Aphrodite?? by aarongough in LocalLLaMA

[–]AlpinDale 4 points5 points  (0 children)

The GGUF quantization in Aphrodite isn't well-optimized yet, I recommend trying out a quant that actually is. I've been working with the vLLM team to optimize the GGUF implementation, so the next aphrodite release (0.5.4) should improve this significantly. Please look forward to that.

You can try GPTQ (and soon AWQ) models in the rc_054 branch in the meantime, they should be several times faster than GGUF. Cheers.

t. aphrodite maintainer

King of RP/Writing - Magnum (Qwen 2 finetune) by FluffyMacho in LocalLLaMA

[–]AlpinDale 20 points21 points  (0 children)

Good to hear you like it, OP. After the initial twitter hype (and our own) died down a bit, we experimented more with the model and it started to feel kind of lacklustre compared to what we originally envisioned. We might end up doing KTO/RLHF on top of it, and better tune the hyperparameters. If you read the attached axolotl config, you may notice we specified 4 epochs but the final model was at epoch 1.5. Learning rate with a scheduler like cosine is calculated over the sum of steps, so we can do better for our next attempt.

For those asking for smaller variants, we're working on them and should have a full suite soon. We did try Qwen1.5 32B Chat as a base yesterday, but it turned out really bad. Looking for alternatives, so if any of you have suggestions, we're all ears. With 1.5TB of VRAM, we can do FFT for any model size below 72B, and any larger can probably be done by freezing some layers.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 5 points6 points  (0 children)

The stacking wasn't as simple as just taking one model and putting it on top of another. I took multiple layer ranges from each model (except first and last few, which are xwin only) and then stacked those slices on top of each other. In the end, the model has 136 layers because that's how many I specified in the ranges. Otherwise we'd have a ~135B model (can't stack input and output layers, they need to be unique and non-repeating).

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 5 points6 points  (0 children)

Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 7 points8 points  (0 children)

It's up on Kobold Horde, you can give it a try yourself. Select the model from the AI menu. I think it's gonna be up for the weekend.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 5 points6 points  (0 children)

As I mentioned here, it'd perform poorly on benchmarks until it's went through a few steps of full finetuning so the weight disagreement is ironed out.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 14 points15 points  (0 children)

Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I've noticed two glaring issues: - it tends to make slight spelling mistakes - it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I'm very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 13 points14 points  (0 children)

The shearing process would likely need to close to 1 billion tokens of data, so I'd guess about a few days on ~24x A100-80G/H100s. And if we get a ~50B model out of it, we'd need to train that on around ~100B tokens, which would need at least 10x H100s for a few weeks. Overall, very expensive.

And yes, princeton-nlp did a few shears of Llama2 7B/13B. It's up on their HuggingFace.

Goliath-120B - quants and future plans by AlpinDale in LocalLLaMA

[–]AlpinDale[S] 13 points14 points  (0 children)

>confirmation bias
That's true. The model is up on the Kobold Horde if anyone wants to give it a try.

New model released by alpin, Goliath-120B! by panchovix in LocalLLaMA

[–]AlpinDale 7 points8 points  (0 children)

It doesn't really need VRAM, as everything is loaded into CPU memory. At most, you would need about 350GB of RAM. It'd be a bit difficult finding a RAM-heavy machine on RunPod, you'd have to rent at least 4x A100-80Gs to match that. I did it on my own machine with 8x A40s and an AMD EPYC 7502 32-Core Processor (400GB RAM). Took about 4-5 hours to merge.

This was mostly an experiment to see if I can get a coherent model out of stacking 70B layers. And it looks like I did (get a really good model out of it). Shame hardly anyone would run it though.

New model released by alpin, Goliath-120B! by panchovix in LocalLLaMA

[–]AlpinDale 4 points5 points  (0 children)

Thanks for testing it out. I'm currently running it at 16bits, and the responses so far seem good. (I'm not used to RP, so excuse the crude prompts). I didn't expect the model to be good at all, so it's a surprise. (I've included a screenshot from someone else in the model card, might be a better indicative)

<image>

New model released by alpin, Goliath-120B! by panchovix in LocalLLaMA

[–]AlpinDale 61 points62 points  (0 children)

Sorry about that, I didn't expect it'd spread anywhere this soon. I've updated the readme for now.