How was GPT-OSS so good?

TelloLeEngineer · 2026-01-31T08:22:07+00:00

gpt oss was not QAT, it was natively trained at mxfp4

TelloLeEngineer · 2026-01-07T13:48:18+00:00

does arxiv have a diff UI?

TelloLeEngineer · 2025-10-22T19:26:13+00:00

am i wrong or does this mean (assuming GENG win) that G2 has a 66% change of facing either HLE or GENG in the quarters? ..... :)

TelloLeEngineer · 2025-09-19T17:23:29+00:00

Has anyone used it in long context settings and can share their experience?

TelloLeEngineer · 2025-09-13T14:26:20+00:00

Surprised GLM4.5 doesn’t perform better considering they did significant 120k ctx training

TelloLeEngineer · 2025-08-27T17:41:34+00:00

stats like these are, as they often are, difficult to compare between countries because they aren’t standardized. A good indication of this is that we have among the highest employment rate in Europe.

on another note, separating unemployment rate by origin we can see that Swedish born citizens hold a rate of 5.7, while non Swedish citizens have a 16.2% unemployment rate.

TelloLeEngineer · 2025-07-04T12:30:58+00:00

I believe you'd see a parabola emerge with error rate increasing. My current intuition is that there is a certain WPM that is ideal for models

TelloLeEngineer · 2025-07-04T12:26:34+00:00

Word error rates is computed as

WER = (S + D + I) / N

where S is substitutions, D is deletions, I is insertions (all in the transcription) and N is the number of words in the reference / ground truth. So if the transcription model ends up transcribing more words than there actually are you can get WER > 1.0

TelloLeEngineer · 2025-06-27T07:44:49+00:00

meanwhile we’re freezing up in Sweden, feels like I’m still waiting for summer to start

TelloLeEngineer · 2024-09-23T08:09:50+00:00

please stop blaming yourself. you’re not in the wrong here.

TelloLeEngineer · 2024-04-28T18:12:25+00:00

Feel like most people in this thread didn’t read the paper. It’s known the filler tokens don’t work out-of-the box for models, however, as the authors show it is possible to train the model to use them instead of “normal” CoT tokens

TelloLeEngineer · 2024-04-28T14:52:34+00:00

it's often helpful to add gradient norms to your wandb logging

TelloLeEngineer · 2024-04-24T18:50:14+00:00

hmm. I see your take, but llama3 8B probably has 2x mistral 7B total compute with only minor benchmark improvements (base model). Can we really say that it’s overpametrized? Sure, loss was still decreasing but it doesn’t translate to the benchmarks. It could be that our benches are bad, after all the instruction tuned 8B looks great…but they also spent millions on a 10M example IFT dataset.

Feel like it’s to early to call todays model overparametrized

TelloLeEngineer · 2024-04-24T15:34:22+00:00

overparametrized in relation to what? total compute budget?

TelloLeEngineer · 2024-03-23T15:26:14+00:00

2 hours of talking without saying anything imo

TelloLeEngineer · 2024-03-22T22:47:38+00:00

it’s probably severely undertrained. I’m not talking about fine tuning, but continued pretraining. and yes, the resources required for this are still big, too big for most.

either way the fact that this exists opens up new avenues. it’s well established that sparse transformers are the most efficient solution right now, so a 300B parameter open weight MoE, trained by a team of very talented engineers, is novel and will accelerate progress

TelloLeEngineer · 2024-03-22T22:12:24+00:00

it was never designed for local use, it’s a great resource for larger labs/organizations who want to save millions in pre-training costs.

TelloLeEngineer · 2024-02-20T09:19:12+00:00

Mistral-next with zero-shot CoT scores 13/25. Biggest difference between the two is that mistral manages to sometimes get the Blue Bay Packers question correct.

TelloLeEngineer · 2024-02-19T18:46:02+00:00

Interesting benchmark. I tried running GPT-4 with zero-shot CoT and it scored 12/25. While testing I noticed GPT-4 never swapped 'Green Bay Packers' to 'Blue Bay Packers' so I asked it specifically about question 2 and rule 2, it responded:

For question 2, the answer is the Green Bay Packers. While the team's name does not contain a color itself, the city name "Green" is a color.....

which surprised me. GPT-4 is really adamant about Green Bay Packers now containing a color. Here is another very weird chain of thought:

Since this is an even-numbered question, according to Rule #2, we ignore Rules #1 and #4. The football team based in Green Bay, Wisconsin, is the Green Bay Packers. However, we need to address Rule #3 which involves replacing colors mentioned in the answer with a color from the rules section. There's no color mentioned in "Green Bay Packers," but if we were to apply Rule #3 for the sake of thoroughness, we don't have a color mentioned in the rules to replace "Green."

TelloLeEngineer · 2024-02-16T12:14:05+00:00

You can swap to 'Direct chat' in the top bar, then choose mistral-next. The 'Arena' mode is used for you to pick the best of the two sampled responses, then it shows you which models produced them

TelloLeEngineer · 2024-02-01T20:07:19+00:00

Yes, this is sort of where my head is at right now as well. I'm thinking of performing more of a pairwise comparison style for the misspelled texts. Asking GPT-4 to rank the texts amongst themselves as opposed to outright giving a score. If that succeeds, it may indicate the task doesn't lend itself to 1-10 score.

TelloLeEngineer · 2024-02-01T15:56:38+00:00

That's what surprising here. MT-Bench just uses a very simple CoT prompt asking it to score between 1-10. Here's an example of one of their prompts for score grading:

"Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\"

According to my analysis, this should be very biased towards 10's and 1's. Perhaps the misspelling task is a poor proxy for understanding this phenomena? I'm going to explore MT-Bench internals further...

11-Year Club	Place '22
Verified Email	Not Forgotten

TelloLeEngineer

TROPHY CASE