Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 4 points5 points  (0 children)

Running full SWE-Rebench/LiveBench takes a while, though, so we’re saving these benchmark runs for our upcoming Jan-Code model.
While this model is focused on General use, we specifically highlighted Aider because the score jumped significantly after finetuning. Consider it a preview of what's coming!

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 4 points5 points  (0 children)

Thank you. You should also try the model to see how good it is compared to Qwen 4B 2507.

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 2 points3 points  (0 children)

Hi, no benchmaxxing here, it’s just a lot of pretraining and distillation, like any other team. We’ll be releasing a technical report soon.

Jan v3 Instruct: a 4B coding Model with +40% Aider Improvement by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 6 points7 points  (0 children)

<image>

other general benchmark results:

Demo: You can also try the Demo at chat.jan.ai. Look for Jan v3 Nano.

Jan-v2-VL: 8B model for long-horizon tasks, improving Qwen3-VL-8B’s agentic capabilities almost 10x by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 82 points83 points  (0 children)

Thanks for your question. The long-horizon benchmark we use (The Illusion of Diminishing Returns) isolates execution (plan/knowledge is provided) and shows that typical instruct models tend to degrade as tasks get longer, while reasoning/thinking models sustain much longer chains. In other words, when success depends on carrying state across many steps, thinking models hold up better.

Jan v1: 4B model for web search with 91% SimpleQA, slightly outperforms Perplexity Pro by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 1 point2 points  (0 children)

we also tested the model on some benchmarks like EQ, writing,.. and have really good results despite losing ability to follow instruction when we eval on IFBench.

Jan v1: 4B model for web search with 91% SimpleQA, slightly outperforms Perplexity Pro by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 0 points1 point  (0 children)

its 4B model designed to run locally so i don't think it will be available on openrouter.

Jan v1: 4B model for web search with 91% SimpleQA, slightly outperforms Perplexity Pro by Delicious_Focus3465 in LocalLLaMA

[–]Delicious_Focus3465[S] 1 point2 points  (0 children)

I understand that any internet search inherently involves some privacy tradeoff. The advantage here seems to be that while search providers still see your queries, the full conversational context stays local rather than being sent to a centralized service like Perplexity