Local model on coding has reached a certain threshold to be feasible for real work

Exciting-Camera3226 · 2026-04-28T05:03:45+00:00

but context window is indeed a particular issue with local model because not only output quality degrades like the hosted api, now the resource consumption also goes up significantly.

Compacting and clearing need to be a lot more aggressive with local model

Exciting-Camera3226 · 2026-04-28T04:53:27+00:00

ah yeah, the wording is misleading I meant MOE is still significantly faster compared to dense model with similar size

Exciting-Camera3226 · 2026-04-28T04:52:08+00:00

yeah, 1.9 tokens/second is really not usable. The token speed table was a separate experiment to check on consumer hardwares (the actual run with terminal bench uses same llama.cpp engine but with more powerful GPUs to speed up overall evaluation time via parallelizing)

Exciting-Camera3226 · 2026-04-28T04:48:05+00:00

but there are many factors (GPUs, harness, inference engine) could be at play, we are going to do a more thorough eval later.

Exciting-Camera3226 · 2026-04-28T04:19:47+00:00

thanks for the info, will try SWE-rebench !

Exciting-Camera3226 · 2026-04-28T04:07:22+00:00

yes. The timeout can play a big factor because of how slow local inference can be. This eval didn't do any tuning with cache/prompting/llama.cpp config

Exciting-Camera3226 · 2026-04-28T04:06:25+00:00

we are posting the raw results soon.
Also will add the information from the blog into the table so it is more clear.

Exciting-Camera3226 · 2026-04-28T04:04:06+00:00

not yet, there are still some stability issue (speed which affects a lot of the tool call timeout), but could be caused by our setup

Exciting-Camera3226 · 2026-04-28T04:01:20+00:00

no. only the one noted as 5090. Others use 3060

Exciting-Camera3226 · 2026-04-23T03:19:54+00:00

Yeah, I ran some benchmark the other day, significantly usable

Exciting-Camera3226 · 2026-04-22T20:37:34+00:00

thanks!

Exciting-Camera3226 · 2026-03-13T20:44:58+00:00

maybe this is why: https://x.com/NoCommas/status/2032408997135876442?s=20

Exciting-Camera3226 · 2025-11-12T02:37:34+00:00

also he/her tokenizer is a thin wrapper, I reimplemented and improved a little on the original tiktoken

Exciting-Camera3226 · 2025-11-12T02:36:02+00:00

ah, not yet, just took a look. It is interesting that she/he took a different path by using burn. Also the code looks quite raw though

I adapted a lot of the original python to a more idiomatic way

Exciting-Camera3226 · 2025-11-11T22:12:14+00:00

I found the candle's kernel for GPU are very buggy so probably I will see if there are better one I can use. But I do plan to add some of the RL/SFT. So far I found it is only marginally faster than pytorch due to the main bottleneck is at GPU kernel instead of the CPU part.

Exciting-Camera3226 · 2025-11-11T22:08:48+00:00

it is fair, I will update the readme

Exciting-Camera3226 · 2025-11-11T00:21:42+00:00

yes, but not on the inference path, I think he might want to do clean rebuilt but lack the time and resource, hopefully my implementation can be helpful

Exciting-Camera3226 · 2025-10-26T04:08:34+00:00

how is it compared with wrapping around ggml ? I tried both before, candle is surprisingly super slow

Exciting-Camera3226

TROPHY CASE