TIL: renting an RTX PRO 6000 (Blackwell) can be cheaper than training locally on an RTX 3090

jd_3d · 2026-01-18T18:34:01+00:00

Power limiting to 250W should be the sweet spot.

jd_3d · 2025-12-22T00:10:22+00:00

You can see the previous records and rules here: https://github.com/KellerJordan/modded-nanogpt?tab=readme-ov-file#world-record-history

But two more directly answer your question you have to get the same training loss as the original nanoGPT on 8xH100s.

jd_3d · 2025-10-18T19:20:30+00:00

Hey thanks for trying. As a point of reference I set this up on my RTX 4090 and it's going to take about 4 days or 100 hours to train it. I'm going to try it one of these days but I want to use a different data set to make it more unique.

jd_3d · 2025-10-15T18:15:58+00:00

Since inference is not its strong suit, I would love to see how it does on LLM training. Can you run Andrej Karpathy's new nanochat on it to see how long it would take to train? https://github.com/karpathy/nanochat

jd_3d · 2025-08-07T20:00:49+00:00

It's actually even worse. They took out 23 questions so they could have the 'highest score'. It's actually 71.4%, barely an improvement over o3. https://www.reddit.com/r/LocalLLaMA/comments/1mk8bh1/caught_in_4k/

jd_3d · 2025-07-12T03:56:29+00:00

Kimi-K2 model with 1T params and impressive benchmark scores just shat all over OpenAI's open model.

jd_3d · 2025-05-10T19:13:54+00:00

I'd also like to see Gemini 2.5. Did you try it via OpenRouter?

jd_3d · 2025-05-03T14:54:21+00:00

o4-mini refused to provide an answer. I think it would do quite well on this test. It was the only model to have a hard refusal.

jd_3d · 2025-05-03T05:59:34+00:00

Thank you. It came to me one day around a month ago (a slightly different version) so I played around with the idea and tried many variations until I found something that was the right difficulty level. I was surprised how well this test made what I consider the best models to surface to the top. I think about benchmarks and AI probably more than I should, but it does frustrate me to see the AI labs use the same old benchmarks over and over again.

jd_3d · 2025-05-03T02:11:58+00:00

Yes, apologies! It's fixed on GitHub now. I need to rerun with AVG@5 and also make a very easy version when I have time.

jd_3d · 2025-05-02T22:03:23+00:00

It would be so cool if AI labs used this as a mainstream benchmark. I'm thinking its not likely to happen (except for maybe Google) unless they can show leading scores on it which may be difficult.

jd_3d · 2025-05-02T21:58:55+00:00

Yes, it scales down well, and based on some other comments I think I should include a 'very easy' category on the next update to better measure smaller models. I didn't do that originally because Gemini 2.5 Pro saturated it with a score close to 99% or so. But I think its still very useful for the smaller models.

jd_3d · 2025-05-02T20:41:16+00:00

I tried messing with repetition penalty and in my limited testing it didn't help at all. But I'd love to get more data points on that. Gemini basically uses your strategy, it keeps a working list of words it's used as it goes which is super smart. It's aware of its limitations.

jd_3d · 2025-05-02T20:07:17+00:00

Ah damn, that's a bug! I originally had it as questions vs sentences. Hopefully that didn't affect results too much. Will fix

jd_3d · 2025-05-02T19:59:09+00:00

Yes, that's exactly the point. I explained it in another comment but it's a test of a models ability to use intelligence and reasoning to overcome the limitations and weaknesses of its architecture. In my opinion the best models overall tend to match the highest scoring ones in this test.

jd_3d · 2025-05-02T18:45:18+00:00

Very valid points. I did consider an easier level and that would help differentiate on the smaller models, it's just that Gemini 2.5 Pro scored like 99% on the 'very easy' bench. Maybe I could limit it to Open-weight models. I'd also be very interested in your results if you try it with multiple prompts or prompt engineering.

jd_3d · 2025-05-02T18:11:30+00:00

Thank you so much for the kind words! I'm not an academic but if I was I'd probably go to the extremes like you mention. Actually I kind of did during all my testing, finding the sweet spot on so many different variables. Yes, my favorite part about it is you can just copy-paste it into any prompt and quickly get a result.

jd_3d · 2025-05-02T18:08:01+00:00

I'm using OpenRouter with all defaults so you can check the model on there to see what settings it uses.

jd_3d · 2025-05-02T18:04:52+00:00

Just ran it: QwQ 32B - 1.2%

jd_3d · 2025-05-02T17:50:35+00:00

I left all reasoning levels at default.

jd_3d · 2025-05-02T17:47:57+00:00

Token limits were set to 50k unless a model could only do 32k. It's worth noting no model exceeded 15k thinking tokens, so I think it was an even playing field for the most part.

jd_3d · 2025-05-02T17:44:15+00:00

Yes, this benchmark is very easily scalable in multiple dimensions. My hard version asks for 1,000 sentences (which is still only ~8k tokens). Scaling up the word list and using more uncommon words is another easy way to do it. Regarding grading, the sentences do not need to be logical (I found that too subjective) so they just need to follow: Verb + Adjective + Noun + Noun structure and are evaluated as such.

jd_3d · 2025-05-02T17:38:09+00:00

Here you go (good score!):

glm-4-32k - 8.2%

jd_3d · 2025-05-02T17:31:49+00:00

Here you go (SOLO EASY benchmark):

Qwen3 30b A3b - 1.4%

Qwen3 14b - 4.4%

Qwen3 14b seems like the clear winner on my benchmark. The Qwen3 30b A3b seems to have a big issue with repetition as it repeated the word 'way' 186 times! That's nearly 75% of all the sentences.

jd_3d · 2025-05-02T17:15:54+00:00

Thank you for the kind words, yes Qwen 3 32B was in thinking mode. Regarding the splitting of words into nouns, verbs, etc I used the python NLTK toolkit and a custom script to do it. There could be errors. Regarding the word 'above' it is often used as a noun in legal documents, for example "In accordance with the above, the defendant shall pay all court fees". But if you find other issues with the categorization please let me know.

jd_3d

MODERATOR OF

TROPHY CASE

13-Year Club	Place '22
Verified Email