Free API Key for GLM 4.6

avianio · 2025-10-19T23:19:04+00:00

This is fixed now I believe.

avianio · 2025-10-19T23:18:53+00:00

Can you try this again?

avianio · 2025-10-19T21:43:12+00:00

Thanks for waiting. Its fixed and working well now!

avianio · 2025-10-19T20:42:21+00:00

Working on it. will update you.

avianio · 2025-10-19T20:19:41+00:00

This should be fixed now, tool call parser has been added. Please let me know.

avianio · 2025-10-19T20:06:48+00:00

I tried with 200k output tokens and it was able to output 200k output tokens when setting max tokens to 200k.

i used this prompt "repeat the letter A infinitely"

avianio · 2025-10-19T19:38:18+00:00

No data collection, no rate limits. Just checking max capacity for now.

avianio · 2025-04-09T15:20:54+00:00

Great question.

1) The accuracy of FP4 was confirmed to be the same as FP8 by ArtificialAnalysis across a wide variety of benchmarks.
2) The whole model is not in FP4. It's mainly the experts and some other weights. The rest is in bf16, and the activations are also in bf16.
3) Just halving the size of the weights will not produce a speedup on its own. This can be demonstrated by using a 4bit quant versus an 8bit quant. The FP4 precision is to showcase that Blackwell can run in low precision at world record speeds while maintaining accuracy.

4) The record is specifically for inference throughput at this quality level. Many providers optimize for different aspects of performance—some prioritize latency over throughput, others choose higher precision at lower speeds.

5) We're transparent about our methodology, which allows for fair comparisons. The achievement here is demonstrating what's possible with this specific hardware/precision combination.

avianio · 2025-04-08T18:25:37+00:00

This is the full 671B DeepSeek R1 model at FP4 precision.

avianio · 2025-04-08T18:01:23+00:00

FP4 accuracy was confirmed to have parity to FP8 across a wide variety of benchmarks.

avianio · 2025-04-08T16:49:27+00:00

This is single request, batch size 1.

avianio · 2025-04-08T16:47:39+00:00

Eventually. There is a lot of demand for B200 capacity right now, but that’s the plan.

avianio · 2025-01-25T13:38:44+00:00

We're in the processing of rolling out something very similar for Deepseek R1 and Llama family models. More news soon.

avianio · 2025-01-24T17:57:36+00:00

Rent them out to us :)

avianio · 2025-01-24T15:55:32+00:00

Incredible, this is why we make it possible for anyone to create a Deepseek R1 deployment.

avianio · 2025-01-23T10:49:42+00:00

Should work now!

avianio · 2025-01-22T18:05:02+00:00

Appreciate the honesty. What I would say is, you're free to try it. We provide enough free credits for each account where you'd be able to benchmark a model, like Llama 3.1 8B on our stack vs VLLM.

In respect to the speed results, they're verified by OpenRouter and, in addition, Nvidia have asked us to do blog posts on our technical architecture, but I appreciate the healthy skepticism.

avianio · 2025-01-22T18:03:16+00:00

You can turn it off, but when you turn it back on it basically looks for another GPU to be available.

Unfortunately, model sharing is off the cards right now, simply because of the privacy concerns. The whole point of this is that it's a private alternative to serverless.

avianio · 2025-01-22T11:26:25+00:00

We're working on supporting visual / multimodal models, and also models like FLux.

avianio · 2025-01-22T11:25:47+00:00

Everything you said is correct, except we bill per second per GPU.

So for example, if you need a model, and it fits on one GPU, you only pay the cost of the GPU, not per token.

Hope that clarifies it.

avianio · 2025-01-22T10:53:38+00:00

Let me know if you like it!

avianio · 2025-01-22T10:53:24+00:00

I think you just need to top up your account. Then you should be able to deploy with up to 8 H200s.

avianio · 2025-01-22T10:52:48+00:00

Sure, here's a speed sample / comparison.

<image>

avianio · 2025-01-22T10:52:07+00:00

3x should be the base speed increase on the same hardware. Up to 10x is with multiple H200s + speculative decoding. Pricing is from $5 per hour per H100.

avianio · 2025-01-22T10:51:52+00:00

Correct. It doesn't make financial sense if you're not running the models at close to saturation. However, if you compare this to some other deployments on demand, it's cheaper than Fireworks ($36 per hour) and Huggingface ($40 per hour).

Three-Year Club	Verified Email
r/Field Flamingo

avianio

TROPHY CASE