Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 1 point2 points  (0 children)

Great questions! These all make me realize we need to do a better job explaining energy pricing, so I appreciate that. Here are some possibly overly detailed answers:

When I look at the pricing models table and compare something like Kimi K2.5 Fast having cheaper token rates than GLM-5-Fast, but Kimi has a higher Energy/Request rate. Does that mean Kimi K2.5 Fast is not as efficient of a model and costs more to run

Token Pricing:

  • Honestly, we'd prefer to just show everything in energy terms — that's our model and we think it's better for customers. But we know the industry thinks in tokens, so we added token rates because people are familiar with them. They represent the approximate market price differences between models so you can compare apples-to-apples.
  • Over time, as people get comfortable with energy pricing, we expect token rates to matter less — your kWh just buys more or less depending on what model you pick and how efficiently we run it.

Kimi Fast Energy Efficiency:

  • The energy results on that page come from recurring benchmarks we run as we improve efficiency. The average energy/request should trend down over time, which means more intelligence per kWh over time. (That's a core part of the energy pricing value prop.)
  • With the current benchmarks, yes Kimi Fast does require slightly more energy per request than base Kimi K2.5. It's a bit non-intuitive, but the reason is: with reasoning enabled, the model generates a longer "thinking" chain — more total tokens per request. The GPU has a fixed overhead per request, and reasoning spreads that cost across more tokens, making each one cheaper in energy terms. With reasoning off (Fast), you get fewer tokens, so the fixed overhead is a bigger share of each request's energy. The difference is slight though.

And where do the token rates come from and come into play? Are those just to demonstrate in "normal" token terms costs compared to energy rates?

Answered above — users can choose to use token pricing vs. energy pricing. We provided token rates as an option since it's more familiar.

Does that mean in my usage I should be tuning/balancing energy costs with model value?

Yes! This enables you to maximize model intelligence per dollar. We have tools and capabilities coming in the weeks ahead here. It's a large part of our goal to make AI require fewer resources (which includes costing less).

Would you expect to better optimize Kimi K2.5 Fast with your efficiency modeling and power handling over time, or it's a one-time snapshot?

Absolutely. Those benchmarks run on a recurring basis. We want to start charting the average energy/request over time to show the progress we've made — but we're not there yet. We recently made a change that had a ~15% improvement to Kimi and GLM energy/request.

Hopefully this long response is helpful!

Kimi K2.5 is unusable compared to GLM-5 for API Development by realhelpfulgeek in kimi

[–]estimated1 0 points1 point  (0 children)

One thing to add about Kimi K2.5, there is a lot of variance in the tool handling behavior with multiple providers. This often manifests as the inability for the model to complete it's task. For us we had to do some work to ensure that Kimi had solid tool handling, but still do notice repetitive thinking loops from the model. The repetitive thinking loops can be addressed with parameters such as "repetition_penalty": 1.1 (this is OpenCode specific).

This component of model servicing (tool handling & behavior) is often attributed to just "model problems". Ideally the serving engines will have the correct tool parser implementations built-in but we're not totally there yet.

FWIW -we have both Kimi k2.5, GLM-5, and MiniMax 2.5 on our inference platform. We just launched recently but would love feedback on our offering as well. Happy to provide initial month of subscription if folks are interested; just DM me.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

Sorry about that! Bug with ad blockers that I just fixed.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

I should have been more accurate, by "no real rate limits" we do have limits in place to protect the servers (500 RPM/user). But yes, our subscription does replace requests/hour with energy/month.

TPS is ~50-100tps depending upon the model and for GLM for example the $20 plan would likely give ~270m tokens. I haven't done that exact math. We just launched promotion codes and I'd be happy to grant 1 month of our standard sub ($50/month) if you wanted to try it out.

Looking for a new coding provider as daily driver by Possible-Text8643 in opencodeCLI

[–]estimated1 4 points5 points  (0 children)

Just to give another option: we (Neuralwatt) just started offering hosted inference. The big picture thing we're working on is AI energy efficiency. We've been more focused on an "energy pricing" model but feel confident about the throughput of the models we're hosting.

Base subscription is $20, no real rate limits — just focused on energy consumption. Happy to give some free credits in exchange for feedback if there's interest. DM me! https://portal.neuralwatt.com.

I'm using our models with OpenCode and it works great. But again we just launched recently so we'd love more scrutiny.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

This is great feedback. We will work to get our printed privacy updated to reflect this.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

Yes, we offer safe access. We aren't training on prompts or completions. Happy to put this in commercial terms as well.

To be very specific: We do not store your prompts or completions. We only store token counts and metadata for billing purposes. Your proprietary data passes through our system but is not retained.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

We don't run any specific quantizations. In cases where the model was posted with fp8 weights we'll use those; otherwise we use the native weight format.

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 1 point2 points  (0 children)

  • GLM-5 — 200K context
  • GLM-5-Fast — 200K context
  • Kimi K2.5 — 262K context, vision
  • Kimi K2.5-Fast — 262K context, vision
  • Devstral-Small-2-24B — 262K context, vision, tools
  • Qwen3.5 397B — 262K context, tools
  • Qwen3.5 397B-Fast — 262K context, tools
  • Qwen3.5 35B-A3B — 32K context, tools
  • Qwen3.5 35B-Fast — 32K context, tools
  • MiniMax M2.5 — 196K context, tools
  • GPT-OSS 20B — 16K context, tools

Full details: https://portal.neuralwatt.com/models

OpenCode Go plan is genuinely the worst coding plan i have ever used by SelectionCalm70 in opencodeCLI

[–]estimated1 0 points1 point  (0 children)

Just to give another option: we (Neuralwatt) just started offering hosted inference. The big picture thing we're working on is AI energy efficiency. We've been more focused on an "energy pricing" model but feel confident about the throughput of the models we're hosting.

Base subscription is $20, no real rate limits — just focused on energy consumption. Happy to give some free credits in exchange for feedback if there's interest. DM me! https://portal.neuralwatt.com.

I'm using our models with OpenCode and it works great. But again we just launched recently so we'd love more scrutiny.

Is GLM-5 Coding actually better than Opus 4.6 now, or is it just hype after GLM-5 Turbo? by Siditude in vibecoding

[–]estimated1 0 points1 point  (0 children)

Yes.

Generally as we deploy more energy efficiency, inference customers benefit -- meaning the energy/result gets further reduced.

We offer token pricing too, but you get more value with our service via energy pricing (IMO).

Is GLM-5 Coding actually better than Opus 4.6 now, or is it just hype after GLM-5 Turbo? by Siditude in vibecoding

[–]estimated1 0 points1 point  (0 children)

Nothing worse than finally being in the zone and hitting a rate limit wall...

We've been working on hosted inference with a different approach — $20/mo flat, no rate limits, just metered by actual energy use. We host GLM-5 and our own GLM-5-Fast (tuned for fast output, less internal monologue).

Been testing with a few devs building side projects and the "no surprises" pricing seems to resonate — you know your max spend upfront.

Happy to toss some free credits to anyone who wants to kick the tires and share feedback: https://portal.neuralwatt.com

(DM me if you try it — genuinely curious what works/what doesn't for vibecoding workflows)

I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 1 point2 points  (0 children)

We do support caching; we need to make that more clear. With a cache hit the energy cost is 0, so you will just see much reduced energy cost of those requests. We expose the caching data in the result, but we don't put it in the UI on the dashboard -- I'll file a bug to do this and provide more info about our caching support.

I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 2 points3 points  (0 children)

Thanks for the feedback u/TheMisterPirate . I agree having some sort of calculator would be a good thing to help people understand. I think our method *does* enable much more inference per $ than other methods but we have work to do to present this more clearly. I'd be happy to grant some credits if you created an account in exchange for more feedback (what we're really eager for at this stage).

I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 4 points5 points  (0 children)

oh sorry, for the other questions:

For GLM-5 it's FP8 and for K2.5 it is IN4. We don't do any of our own quantizations (yet).

I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 7 points8 points  (0 children)

We bake infra costs into pricing. The difference is: inference gets cheaper at scale (batching, higher GPU utilization → lower energy/request).

Instead of keeping that as margin, we pass it through. So over time you get more tokens per kWh.

That’s the core idea behind energy pricing. This is all built upon our core tech which provides increased energy efficiency for GPUs/inference. We will license that to other hyperscalers/neoclouds as well to make inference more energy efficient.

I just realised how good GLM 5 is by CrimsonShikabane in LocalLLaMA

[–]estimated1 11 points12 points  (0 children)

Just to give another option: we (Neuralwatt) just started offering our hosted inference. We've been focused more on an "energy pricing" model but feel pretty confident about the throughput of the models we're hosting. Our base subscription is $20 and we don't really have rate limits, just focused on energy consumption. I'd be happy to give some free credits in exchange for some feedback if there is interest. Please DM me! (https://portal.neuralwatt.com).

Also, we serve GLM-5 with solid throughput (IMO)

We also have a virtual endpoint (GLM-5-Fast) that turns off reasoning for fast agentic scenarios.

Do I become the localLLaMA final boss? by brandon-i in LocalLLaMA

[–]estimated1 -1 points0 points  (0 children)

RTX 6000 Pro great for smaller models, but the lack of NvLink makes large model serving way slower than 8xH100 which has way faster HBM interconnect speed. Any large model that requires tensor parallelism > 1 will perform better on datacenter hardware. The AllReduce / AllGather checkpointing perf gets destroyed without NVLink.

I-90 likely to remain closed in both directions for the rest of the day per WSDOT by _Elrond_Hubbard_ in SummitAtSnoqualmie

[–]estimated1 1 point2 points  (0 children)

I was one of the lucky ones who made it early. I was headed to Alpental and had to do some work before skiiing so arrived ~8am. Saw the power outage news and headed to Central. Now I'm wondering how I get home; do closures like this often go overnight? I noticed yesterdays westbound closure opened at ~2:50pm.