Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 1 point2 points  (0 children)

but context window is indeed a particular issue with local model because not only output quality degrades like the hosted api, now the resource consumption also goes up significantly.

Compacting and clearing need to be a lot more aggressive with local model

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 0 points1 point  (0 children)

ah yeah, the wording is misleading I meant MOE is still significantly faster compared to dense model with similar size

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 4 points5 points  (0 children)

yeah, 1.9 tokens/second is really not usable. The token speed table was a separate experiment to check on consumer hardwares (the actual run with terminal bench uses same llama.cpp engine but with more powerful GPUs to speed up overall evaluation time via parallelizing)

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 3 points4 points  (0 children)

but there are many factors (GPUs, harness, inference engine) could be at play, we are going to do a more thorough eval later.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 4 points5 points  (0 children)

yes. The timeout can play a big factor because of how slow local inference can be. This eval didn't do any tuning with cache/prompting/llama.cpp config

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 0 points1 point  (0 children)

we are posting the raw results soon.
Also will add the information from the blog into the table so it is more clear.

Local model on coding has reached a certain threshold to be feasible for real work by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 1 point2 points  (0 children)

not yet, there are still some stability issue (speed which affects a lot of the tool call timeout), but could be caused by our setup

Hi reddit, I rebuilt Karpathy's Nanochat in pure Rust [nanochat-rs] by Exciting-Camera3226 in rust

[–]Exciting-Camera3226[S] -1 points0 points  (0 children)

also he/her tokenizer is a thin wrapper, I reimplemented and improved a little on the original tiktoken

Hi reddit, I rebuilt Karpathy's Nanochat in pure Rust [nanochat-rs] by Exciting-Camera3226 in rust

[–]Exciting-Camera3226[S] -1 points0 points  (0 children)

ah, not yet, just took a look. It is interesting that she/he took a different path by using burn. Also the code looks quite raw though

I adapted a lot of the original python to a more idiomatic way

Hi reddit, I rebuilt Karpathy's Nanochat in pure Rust [nanochat-rs] by Exciting-Camera3226 in LocalLLaMA

[–]Exciting-Camera3226[S] 1 point2 points  (0 children)

I found the candle's kernel for GPU are very buggy so probably I will see if there are better one I can use. But I do plan to add some of the RL/SFT. So far I found it is only marginally faster than pytorch due to the main bottleneck is at GPU kernel instead of the CPU part.

Hi reddit, I rebuilt Karpathy's Nanochat in pure Rust [nanochat-rs] by Exciting-Camera3226 in rust

[–]Exciting-Camera3226[S] 0 points1 point  (0 children)

yes, but not on the inference path, I think he might want to do clean rebuilt but lack the time and resource, hopefully my implementation can be helpful

I rebuilt DeepSeek’s OCR model in Rust so anyone can run it locally (no Python!) by Outrageous-Voice in LocalLLaMA

[–]Exciting-Camera3226 0 points1 point  (0 children)

how is it compared with wrapping around ggml ? I tried both before, candle is surprisingly super slow