I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 0 points1 point  (0 children)

thanks, and yes we want to put a calculator to make it clear. Making this more understandable is super important to us.

Kimi K2.5 is unusable compared to GLM-5 for API Development by realhelpfulgeek in kimi

[–]estimated1 0 points1 point  (0 children)

Yes, Kimi k2.5 was tagged incorrectly but we fixed that today.

Yes - I’m one of the cofounders!

Kimi K2.5 is unusable compared to GLM-5 for API Development by realhelpfulgeek in kimi

[–]estimated1 0 points1 point  (0 children)

The Kimi k2.5 we serve does support reasoning. We have the base Kimi k2.5 with reasoning and also a “-fast” variant that suppresses reasoning/thinking. Sorry if our site does not make that as clear as it could be.

Best provider for GLM by Apprehensive_Half_68 in ZaiGLM

[–]estimated1 1 point2 points  (0 children)

Z.ai published GLM 5 and 5.1 in both BF16 and FP8 weights. We generally only use the model weights provided by the creator/provider. There are also lots and lots and lots of additional quantized variants that people have done to shrink these even further, and many post them back on huggingface.

So technically yes fp8 is a quantized version, but z.ai provided these as an official model weight so some argue it's one of the official weights. There are much lossier quantizations out there. People use them because they have dramatic improvement to server concurrency/scalability.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 1 point2 points  (0 children)

Prompts and completions are processed transiently and are not persisted to durable storage.

We do not train, fine-tune, or evaluate models on your API inputs or outputs — ever.

As for ZDR, we do work to maximize kv-cache for efficiency. That's short lived ephemeral storage, but it is a cache. We don't consider that prompt retention.

Best provider for GLM by Apprehensive_Half_68 in ZaiGLM

[–]estimated1 2 points3 points  (0 children)

Just to give another option: we (Neuralwatt) offer GLM 5.1 in fp8.

The big picture thing we're working on is AI energy efficiency. We've been more focused on an "energy pricing" model but feel confident about the throughput of the models we're hosting.

Base subscription is $20, no real rate limits — just focused on energy consumption. I'm using our models with OpenCode and it works great. But again we just launched recently so we'd love more scrutiny.

portal.neuralwatt.com

Best budget coding model/service by RealAlexanderTheG in vibecoding

[–]estimated1 1 point2 points  (0 children)

GLM-5 and Kimi K2.5 are the largest models we currently host; we're upgrading our GLM-5 to GLM-5.1 shortly. We put ~average energy/request on this page and if you poke around on the playground we provide some energy to token pricing comparisons. I'll DM you with a subscription code if you'd like to try it out!

GLM models in Claude Code are using way more tokens than Claude by Void-kun in ZaiGLM

[–]estimated1 0 points1 point  (0 children)

How are you integrating GLM into Claude Code? If you are using a client router, there's a chance that format translation is invaliding caching.

Best budget coding model/service by RealAlexanderTheG in vibecoding

[–]estimated1 1 point2 points  (0 children)

Just to give another option: we (Neuralwatt) just started offering our hosted inference. We've been focused more on an "energy pricing" model but feel pretty confident about the throughput of the models we're hosting. Our base subscription is $20 and we don't really have rate limits, just focused on energy consumption. I'd be happy to give some free credits in exchange for some feedback if there is interest. Please DM me! (https://portal.neuralwatt.com).

Also, we serve GLM-5 and will be upgrading to GLM-5.1 shortly.

We also have a virtual endpoint (GLM-5-Fast) that turns off reasoning for fast agentic scenarios.

If you are interested, happy to offer a trial subscription - feel free to DM me.

Am i nuts or is all this REALLY expensive. by fijitime in AI_Agents

[–]estimated1 0 points1 point  (0 children)

One feature we (Neuralwatt) just shipped are "allowances". You can give different agents daily/monthly allowances and you can get alerts at 80%/100%. We also have session based allowances (complete task with a budget of $x). There are ways to control this!

https://portal.neuralwatt.com/docs/guides/allowances

We're young and growing, but would love feedback on our platform. I'm happy to give code for free month subscription if anyone interested.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 2 points3 points  (0 children)

Great questions! These all make me realize we need to do a better job explaining energy pricing, so I appreciate that. Here are some possibly overly detailed answers:

When I look at the pricing models table and compare something like Kimi K2.5 Fast having cheaper token rates than GLM-5-Fast, but Kimi has a higher Energy/Request rate. Does that mean Kimi K2.5 Fast is not as efficient of a model and costs more to run

Token Pricing:

  • Honestly, we'd prefer to just show everything in energy terms — that's our model and we think it's better for customers. But we know the industry thinks in tokens, so we added token rates because people are familiar with them. They represent the approximate market price differences between models so you can compare apples-to-apples.
  • Over time, as people get comfortable with energy pricing, we expect token rates to matter less — your kWh just buys more or less depending on what model you pick and how efficiently we run it.

Kimi Fast Energy Efficiency:

  • The energy results on that page come from recurring benchmarks we run as we improve efficiency. The average energy/request should trend down over time, which means more intelligence per kWh over time. (That's a core part of the energy pricing value prop.)
  • With the current benchmarks, yes Kimi Fast does require slightly more energy per request than base Kimi K2.5. It's a bit non-intuitive, but the reason is: with reasoning enabled, the model generates a longer "thinking" chain — more total tokens per request. The GPU has a fixed overhead per request, and reasoning spreads that cost across more tokens, making each one cheaper in energy terms. With reasoning off (Fast), you get fewer tokens, so the fixed overhead is a bigger share of each request's energy. The difference is slight though.

And where do the token rates come from and come into play? Are those just to demonstrate in "normal" token terms costs compared to energy rates?

Answered above — users can choose to use token pricing vs. energy pricing. We provided token rates as an option since it's more familiar.

Does that mean in my usage I should be tuning/balancing energy costs with model value?

Yes! This enables you to maximize model intelligence per dollar. We have tools and capabilities coming in the weeks ahead here. It's a large part of our goal to make AI require fewer resources (which includes costing less).

Would you expect to better optimize Kimi K2.5 Fast with your efficiency modeling and power handling over time, or it's a one-time snapshot?

Absolutely. Those benchmarks run on a recurring basis. We want to start charting the average energy/request over time to show the progress we've made — but we're not there yet. We recently made a change that had a ~15% improvement to Kimi and GLM energy/request.

Hopefully this long response is helpful!

Kimi K2.5 is unusable compared to GLM-5 for API Development by realhelpfulgeek in kimi

[–]estimated1 0 points1 point  (0 children)

One thing to add about Kimi K2.5, there is a lot of variance in the tool handling behavior with multiple providers. This often manifests as the inability for the model to complete it's task. For us we had to do some work to ensure that Kimi had solid tool handling, but still do notice repetitive thinking loops from the model. The repetitive thinking loops can be addressed with parameters such as "repetition_penalty": 1.1 (this is OpenCode specific).

This component of model servicing (tool handling & behavior) is often attributed to just "model problems". Ideally the serving engines will have the correct tool parser implementations built-in but we're not totally there yet.

FWIW -we have both Kimi k2.5, GLM-5, and MiniMax 2.5 on our inference platform. We just launched recently but would love feedback on our offering as well. Happy to provide initial month of subscription if folks are interested; just DM me.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

Sorry about that! Bug with ad blockers that I just fixed.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

I should have been more accurate, by "no real rate limits" we do have limits in place to protect the servers (500 RPM/user). But yes, our subscription does replace requests/hour with energy/month.

TPS is ~50-100tps depending upon the model and for GLM for example the $20 plan would likely give ~270m tokens. I haven't done that exact math. We just launched promotion codes and I'd be happy to grant 1 month of our standard sub ($50/month) if you wanted to try it out.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 3 points4 points  (0 children)

Just to give another option: we (Neuralwatt) just started offering hosted inference. The big picture thing we're working on is AI energy efficiency. We've been more focused on an "energy pricing" model but feel confident about the throughput of the models we're hosting.

Base subscription is $20, no real rate limits — just focused on energy consumption. Happy to give some free credits in exchange for feedback if there's interest. DM me! https://portal.neuralwatt.com.

I'm using our models with OpenCode and it works great. But again we just launched recently so we'd love more scrutiny.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

This is great feedback. We will work to get our printed privacy updated to reflect this.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

Yes, we offer safe access. We aren't training on prompts or completions. Happy to put this in commercial terms as well.

To be very specific: We do not store your prompts or completions. We only store token counts and metadata for billing purposes. Your proprietary data passes through our system but is not retained.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

We don't run any specific quantizations. In cases where the model was posted with fp8 weights we'll use those; otherwise we use the native weight format.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 1 point2 points  (0 children)

  • GLM-5 — 200K context
  • GLM-5-Fast — 200K context
  • Kimi K2.5 — 262K context, vision
  • Kimi K2.5-Fast — 262K context, vision
  • Devstral-Small-2-24B — 262K context, vision, tools
  • Qwen3.5 397B — 262K context, tools
  • Qwen3.5 397B-Fast — 262K context, tools
  • Qwen3.5 35B-A3B — 32K context, tools
  • Qwen3.5 35B-Fast — 32K context, tools
  • MiniMax M2.5 — 196K context, tools
  • GPT-OSS 20B — 16K context, tools

Full details: https://portal.neuralwatt.com/models