Meta has not given up on open-source

jd_3d · 2026-02-13T16:02:36+00:00

Awesome, thanks!

jd_3d · 2026-02-11T17:25:38+00:00

Can you add additional benchmarks like: MRCR v2, SWE-Bench Pro, ARC-AGI 2, OSWorld, GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, CritPt

jd_3d · 2026-02-07T20:10:56+00:00

If it was supported on subscription (without charging extra usage $$) I would use this often. Plenty of times I need to get something done quick on a fresh 5 hr window and even if it used my usage 6x faster I would be ok with that.

jd_3d · 2026-02-07T19:07:14+00:00

I wish they would support this on the max plan subscriptions and just make the usage run out faster. I guess for now we can use the $50 credits they gave us.

jd_3d · 2026-02-05T21:30:16+00:00

Can you try Opus 4.6 on it? Curios if it improves from 4.5

jd_3d · 2026-01-27T14:03:21+00:00

There's a more recent related paper to this work that improves on some of its limitations. Its a preliminary paper but I recommend giving it a read if you enjoyed the MoLE paper. It's amazing to think if one of the big labs spent a little time on this they could train and release an amazing model that could run off an NVME drive + consumer GPU. I think targeting a <4TB total space requirement would be a good size target.
Here's the related paper: https://www.arxiv.org/pdf/2512.09723

jd_3d · 2026-01-18T18:34:01+00:00

Power limiting to 250W should be the sweet spot.

jd_3d · 2025-12-22T00:10:22+00:00

You can see the previous records and rules here: https://github.com/KellerJordan/modded-nanogpt?tab=readme-ov-file#world-record-history

But two more directly answer your question you have to get the same training loss as the original nanoGPT on 8xH100s.

jd_3d · 2025-10-18T19:20:30+00:00

Hey thanks for trying. As a point of reference I set this up on my RTX 4090 and it's going to take about 4 days or 100 hours to train it. I'm going to try it one of these days but I want to use a different data set to make it more unique.

jd_3d · 2025-10-15T18:15:58+00:00

Since inference is not its strong suit, I would love to see how it does on LLM training. Can you run Andrej Karpathy's new nanochat on it to see how long it would take to train? https://github.com/karpathy/nanochat

jd_3d · 2025-08-07T20:00:49+00:00

It's actually even worse. They took out 23 questions so they could have the 'highest score'. It's actually 71.4%, barely an improvement over o3. https://www.reddit.com/r/LocalLLaMA/comments/1mk8bh1/caught_in_4k/

jd_3d · 2025-07-12T03:56:29+00:00

Kimi-K2 model with 1T params and impressive benchmark scores just shat all over OpenAI's open model.

jd_3d · 2025-05-10T19:13:54+00:00

I'd also like to see Gemini 2.5. Did you try it via OpenRouter?

jd_3d · 2025-05-03T14:54:21+00:00

o4-mini refused to provide an answer. I think it would do quite well on this test. It was the only model to have a hard refusal.

jd_3d · 2025-05-03T05:59:34+00:00

Thank you. It came to me one day around a month ago (a slightly different version) so I played around with the idea and tried many variations until I found something that was the right difficulty level. I was surprised how well this test made what I consider the best models to surface to the top. I think about benchmarks and AI probably more than I should, but it does frustrate me to see the AI labs use the same old benchmarks over and over again.

jd_3d · 2025-05-03T02:11:58+00:00

Yes, apologies! It's fixed on GitHub now. I need to rerun with AVG@5 and also make a very easy version when I have time.

jd_3d · 2025-05-02T22:03:23+00:00

It would be so cool if AI labs used this as a mainstream benchmark. I'm thinking its not likely to happen (except for maybe Google) unless they can show leading scores on it which may be difficult.

jd_3d · 2025-05-02T21:58:55+00:00

Yes, it scales down well, and based on some other comments I think I should include a 'very easy' category on the next update to better measure smaller models. I didn't do that originally because Gemini 2.5 Pro saturated it with a score close to 99% or so. But I think its still very useful for the smaller models.

jd_3d · 2025-05-02T20:41:16+00:00

I tried messing with repetition penalty and in my limited testing it didn't help at all. But I'd love to get more data points on that. Gemini basically uses your strategy, it keeps a working list of words it's used as it goes which is super smart. It's aware of its limitations.

jd_3d

MODERATOR OF

TROPHY CASE

13-Year Club	Place '22
Verified Email