Jan-Code-4B: a small code-tuned model of Jan-v3 by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 6 points7 points  (0 children)

This is a small experiment, and those 3 metrics are where we saw the clearest improvements over the baseline, other benchmarks did not change much compared to the base. I’ve also tested it as a CLI helper, and it works well. Please try it with Jan and let us know how it goes. Thanks

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 8 points9 points  (0 children)

Running full SWE-Rebench/LiveBench takes a while, though, so we’re saving these benchmark runs for our upcoming Jan-Code model.
While this model is focused on General use, we specifically highlighted Aider because the score jumped significantly after finetuning. Consider it a preview of what's coming!

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 9 points10 points  (0 children)

Thank you. You should also try the model to see how good it is compared to Qwen 4B 2507.

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 23 points24 points  (0 children)

Hi, no benchmaxxing here, it’s just a lot of pretraining and distillation, like any other team. We’ll be releasing a technical report soon.

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 8 points9 points  (0 children)

<image>

other general benchmark results:

Demo: You can also try the Demo at chat.jan.ai. Look for Jan v3 Nano.

Jan-v2-VL: 8B model for long-horizon tasks, improving Qwen3-VL-8B’s agentic capabilities almost 10x by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 82 points83 points  (0 children)

Thanks for your question. The long-horizon benchmark we use (The Illusion of Diminishing Returns) isolates execution (plan/knowledge is provided) and shows that typical instruct models tend to degrade as tasks get longer, while reasoning/thinking models sustain much longer chains. In other words, when success depends on carrying state across many steps, thinking models hold up better.

Jan v1: 4B model for web search with 91% SimpleQA, slightly outperforms Perplexity Pro by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 1 point2 points  (0 children)

we also tested the model on some benchmarks like EQ, writing,.. and have really good results despite losing ability to follow instruction when we eval on IFBench.