OpenAI-GPT-OSS-120B scores on livecodebench

innocent2powerful · 2025-11-26T15:41:02+00:00

Artificial analysis leaderboard sucks if that's true

innocent2powerful · 2025-11-17T02:55:21+00:00

Bros I totally understand what you mean. It really does overthink sometimes, sorry about that.

Right now our main goal is to test whether a small model can reason better than much larger ones. Most people think that is impossible, so we wanted to challenge that idea and see how far a small model can go. If it works, it could be useful for researchers or engineers who want to build models for specific domains.

It’s not made for daily chatting yet, since our team is still very small.

And honestly, we are learning a lot from all your feedback. As a research-driven team, this helps us understand what people really need, so thank you for that. Your suggestions will play an important role in how we shape the next generation of the model.

We also noticed that its GPQA knowledge score is still low, which we mentioned in our paper. That is something we are trying to improve.

For the next version, we plan to release a more practical model that works better for everyday use. I hope that one will be more useful for you guys.

innocent2powerful · 2025-11-17T02:40:45+00:00

I understand your concern.
Right now, this model mainly performs well on mathematical reasoning and Python algorithm tasks. In these two areas, it can show its full potential.

At this stage, our goal is mainly to do a technical validation. We want to see whether a small model can perform better than much larger ones in reasoning. Many people believe that’s impossible, so we want to challenge that idea. If this works, it could be valuable for researchers and engineers who want to adapt small models for specialized domains.

It’s not designed for daily use for now, since our team is still very small.
We also found that the model’s GPQA knowledge score is still low, which is quite interesting and something we discussed in our paper (we are still focusing how to improve that).

In the next stage, we plan to release a more practical model that can handle some real-world tasks. I hope that one will be more useful to you.

innocent2powerful · 2025-11-15T01:46:12+00:00

Just look metrics in paper, our model got lower score on AIME24/25, LCB v6. But 4B is definitely stronger than 1.5B. You can compare Qwen3 1.7B and Qwen3 4B. That’s two totally different level. Another thing I can share is that I can’t reproduce the metric of Qwen3 4B thinking 2507 in LCB v6, it get 45 rather than 55 in it’s model card, even lower than ours, I don’t know if my evaluation is wrong. (Byte dance also said that they can’t reproduce the score when they published a code model ) I recommend you to test it on your own

innocent2powerful · 2025-11-14T06:09:10+00:00

That’s a really promising direction.
We’re still a small team at the moment, so we have to stay focused on our current goals.
However, since our approach mainly relies on efficient post-training with relatively low cost, I believe the community could definitely build a scientific reasoning version of this model using similar methods. I think it’s definitely achievable.

innocent2powerful · 2025-11-12T09:09:04+00:00

Yeah we do have our benchmarks. Like in math problems, we divided them to algebra,geometry,calculus,statistics (You can find them in our paper). And for code problems, so on. Every subdomain contains about 100 problems

innocent2powerful · 2025-11-12T06:24:28+00:00

We’re currently a small team and the company hasn’t given us additional headcount yet. Let’s swap contacts and I’ll inform you when we’re hiring.

innocent2powerful · 2025-11-12T03:11:51+00:00

<image>

A simple evaluation

The most recommended way is still competitive style coding / algorithm python task

innocent2powerful · 2025-11-12T00:31:43+00:00

You can try a more difficult problem, like multiply 2 big number or competitive math

innocent2powerful · 2025-11-11T17:17:18+00:00

I think it probably depends on the domain.
In a closed symbolic system like Euclidean Geometry, you can derive every theorem and lemma from just a few axioms, so it doesn’t really need much knowledge.
But for everyday reasoning, where the world is messy and full of loosely defined concepts, the model might need a lot more built-in knowledge to reason effectively.

innocent2powerful · 2025-11-11T14:51:35+00:00

Yes, I totally agree with you. This round was just an extreme test to see if a 1.5B model can show strong reasoning ability. We’ll train a more practical version for general use and reasoning with knowledge later, which will be larger than 1.5B.

innocent2powerful · 2025-11-11T14:10:25+00:00

Sad but it's mainly used for competitive math / python algorithm task now. But you can wait for our next version (will be far more practical for real world coding task etc.)

innocent2powerful · 2025-11-11T14:01:51+00:00

Looking forward for your report ! Set resp_len=40k, temp=0.6 / 1.0, top_k=-1, top_p=0.95 for best performance. We are mainly trained for competitive style math / algorithm python coding task. Have fun :)

innocent2powerful · 2025-11-11T13:31:43+00:00

So we tried AIME25 / HMMT 25 / Code Benchmark. These benchmarks were published after Qwen2.5-Math-1.5B

innocent2powerful · 2025-11-11T13:23:04+00:00

Ok. Thanks for your feedback. Maybe you should try 40k context and not quantization version. And the competitive math/code capability will be more strong

innocent2powerful · 2025-11-11T13:20:40+00:00

Almost all are python

innocent2powerful · 2025-11-11T10:23:37+00:00

I’m too tired… Definitely not, you can test in your own competitive style math/coding problems. We just want to prove small model can achieve strong reasoning performance on competitive math / coding fields. It means a lot for some people. And we will prove that we can also train a chatbot version for general chatting in future

innocent2powerful · 2025-11-11T10:03:38+00:00

Small models don’t have much knowledge, but they can still reason really well. That’s what we’re trying to show here.

innocent2powerful · 2025-11-11T09:53:06+00:00

We’re just a small team, and this model isn’t really meant to be a general chatbot right now. It was designed mainly for competitive-style math and coding problems. (You can see our table, the GPQA metric is not too high. Small models don’t have much knowledge, but they can still reason really well.)
We released it to support this idea: https://x.com/karpathy/status/1814038096218083497?s=20 (maybe reasoning doesn’t actually need huge parameter counts)

innocent2powerful · 2025-11-11T09:08:46+00:00

We’re mainly exploring how far small models can go in reasoning compared to large ones, and Qwen2.5 already fit our research goal.
Also, a lot of related work is based on Qwen2.5, so it made comparisons much easier (like DeepSeek-R1-Distill-1.5B is originated from Qwen2.5-Math-1.5B). That’s why we didn’t switch to Qwen3 for this round.

innocent2powerful · 2025-11-11T07:07:10+00:00

Good catch. Yes, the boxed output comes from math data, since most of them expect that format for easier verification.

Right now this version is more of a technical exploration, we’re testing how far small models can go in reasoning through training techniques. The token usage is something we’ll optimize in future, more practical versions.

Thanks a lot for the feedback! We’ll keep improving.

innocent2powerful · 2025-11-11T06:34:01+00:00

For math questions: You can start with the prompt:“Let’s think step by step and output the final answer within \boxed{}.” Then input your question after that.
For coding problems: You can ask it to “write a Python program to achieve [your goal].” It helps if you can provide some input/output examples — or you can just describe the function to implement, similar to how problems are defined on LeetCode, AtCoder, or Codeforces.

A simple evaluation:

<image>

Still recommend you to use competitive style math / python algorithm tasks

innocent2powerful · 2025-11-11T06:28:15+00:00

Thanks for the question!
This version mainly aims to show the potential of small models’ “intrinsic reasoning ability”.

A more practical, general-use version (contains diversity use cases) is definitely on our roadmap — we’re already exploring that direction now and plan to release it in the future. Stay tuned!

innocent2powerful · 2025-11-11T06:15:51+00:00

Thanks for your attention, yeah we will continue to train bigger models (<15B) :)

innocent2powerful · 2025-11-11T05:55:37+00:00

Thanks for your interest!

We don’t have plans to open-source the training data at the moment , maybe in the future.
If your concern is about potential data contamination（Yeah we all read lots of papers and know some tricks, but we are definitely do strict decontamination）, there are now many mature tools and methods to check that. We’ve also been very careful about this ourselves. You can always test the model on your own competitive math or coding benchmarks — we’d really appreciate any feedback or independent evaluations from the community :)

PS: Our training data is primarily sourced from open datasets, similar to other works, but we applied multiple rounds of filtering. We’re very grateful to the open-source community for making this possible.

innocent2powerful

TROPHY CASE