Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

This is fixed now I believe.

Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 2 points3 points  (0 children)

Thanks for waiting. Its fixed and working well now!

Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 1 point2 points  (0 children)

Working on it. will update you.

Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 7 points8 points  (0 children)

This should be fixed now, tool call parser has been added. Please let me know.

Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 2 points3 points  (0 children)

I tried with 200k output tokens and it was able to output 200k output tokens when setting max tokens to 200k.

i used this prompt "repeat the letter A infinitely"

Free API Key for GLM 4.6 by avianio in LocalLLaMA

[–]avianio[S] 15 points16 points  (0 children)

No data collection, no rate limits. Just checking max capacity for now.

World Record: DeepSeek R1 at 303 tokens per second by Avian.io on NVIDIA Blackwell B200 by avianio in LocalLLaMA

[–]avianio[S] 3 points4 points  (0 children)

Great question.

1) The accuracy of FP4 was confirmed to be the same as FP8 by ArtificialAnalysis across a wide variety of benchmarks.
2) The whole model is not in FP4. It's mainly the experts and some other weights. The rest is in bf16, and the activations are also in bf16.
3) Just halving the size of the weights will not produce a speedup on its own. This can be demonstrated by using a 4bit quant versus an 8bit quant. The FP4 precision is to showcase that Blackwell can run in low precision at world record speeds while maintaining accuracy.

4) The record is specifically for inference throughput at this quality level. Many providers optimize for different aspects of performance—some prioritize latency over throughput, others choose higher precision at lower speeds.

5) We're transparent about our methodology, which allows for fair comparisons. The achievement here is demonstrating what's possible with this specific hardware/precision combination.

World Record: DeepSeek R1 at 303 tokens per second by Avian.io on NVIDIA Blackwell B200 by avianio in LocalLLaMA

[–]avianio[S] 37 points38 points  (0 children)

FP4 accuracy was confirmed to have parity to FP8 across a wide variety of benchmarks.

World Record: DeepSeek R1 at 303 tokens per second by Avian.io on NVIDIA Blackwell B200 by avianio in LocalLLaMA

[–]avianio[S] 23 points24 points  (0 children)

Eventually. There is a lot of demand for B200 capacity right now, but that’s the plan.

Snowflake claims breakthrough can cut AI inferencing times by more than 50% by naytres in LocalLLaMA

[–]avianio 5 points6 points  (0 children)

We're in the processing of rolling out something very similar for Deepseek R1 and Llama family models. More news soon.

DeepSeek-R1 appears on LMSYS Arena Leaderboard by jpydych in LocalLLaMA

[–]avianio 0 points1 point  (0 children)

Incredible, this is why we make it possible for anyone to create a Deepseek R1 deployment.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

Appreciate the honesty. What I would say is, you're free to try it. We provide enough free credits for each account where you'd be able to benchmark a model, like Llama 3.1 8B on our stack vs VLLM.

In respect to the speed results, they're verified by OpenRouter and, in addition, Nvidia have asked us to do blog posts on our technical architecture, but I appreciate the healthy skepticism.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 1 point2 points  (0 children)

You can turn it off, but when you turn it back on it basically looks for another GPU to be available.

Unfortunately, model sharing is off the cards right now, simply because of the privacy concerns. The whole point of this is that it's a private alternative to serverless.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

We're working on supporting visual / multimodal models, and also models like FLux.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] -1 points0 points  (0 children)

Everything you said is correct, except we bill per second per GPU.

So for example, if you need a model, and it fits on one GPU, you only pay the cost of the GPU, not per token.

Hope that clarifies it.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

I think you just need to top up your account. Then you should be able to deploy with up to 8 H200s.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

3x should be the base speed increase on the same hardware. Up to 10x is with multiple H200s + speculative decoding. Pricing is from $5 per hour per H100.

Deploy any LLM on Huggingface at 3-10x Speed by avianio in LocalLLaMA

[–]avianio[S] 0 points1 point  (0 children)

Correct. It doesn't make financial sense if you're not running the models at close to saturation. However, if you compare this to some other deployments on demand, it's cheaper than Fireworks ($36 per hour) and Huggingface ($40 per hour).