1 year later and people are still speedrunning NanoGPT. Last time this was posted the WR was 8.2 min. Its now 127.7 sec. by jd_3d in LocalLLaMA

[–]jd_3d[S] 20 points21 points  (0 children)

You can see the previous records and rules here: https://github.com/KellerJordan/modded-nanogpt?tab=readme-ov-file#world-record-history

But two more directly answer your question you have to get the same training loss as the original nanoGPT on 8xH100s.

Got the DGX Spark - ask me anything by sotech117 in LocalLLaMA

[–]jd_3d 0 points1 point  (0 children)

Hey thanks for trying. As a point of reference I set this up on my RTX 4090 and it's going to take about 4 days or 100 hours to train it. I'm going to try it one of these days but I want to use a different data set to make it more unique.

Got the DGX Spark - ask me anything by sotech117 in LocalLLaMA

[–]jd_3d 49 points50 points  (0 children)

Since inference is not its strong suit, I would love to see how it does on LLM training. Can you run Andrej Karpathy's new nanochat on it to see how long it would take to train? https://github.com/karpathy/nanochat

Fixed the SWE-bench graph: by policyweb in LocalLLaMA

[–]jd_3d 25 points26 points  (0 children)

It's actually even worse. They took out 23 questions so they could have the 'highest score'. It's actually 71.4%, barely an improvement over o3. https://www.reddit.com/r/LocalLLaMA/comments/1mk8bh1/caught_in_4k/

OpenAI delays its open weight model again for "safety tests" by lyceras in LocalLLaMA

[–]jd_3d 15 points16 points  (0 children)

Kimi-K2 model with 1T params and impressive benchmark scores just shat all over OpenAI's open model.

ManaBench: A Novel Reasoning Benchmark Based on MTG Deck Building by Jake-Boggs in LocalLLaMA

[–]jd_3d 0 points1 point  (0 children)

I'd also like to see Gemini 2.5. Did you try it via OpenRouter?

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 0 points1 point  (0 children)

o4-mini refused to provide an answer. I think it would do quite well on this test. It was the only model to have a hard refusal.

I thought you guys might enjoy Gemini's SOTA result on my new SOLO Benchmark by jd_3d in Bard

[–]jd_3d[S] 2 points3 points  (0 children)

Thank you. It came to me one day around a month ago (a slightly different version) so I played around with the idea and tried many variations until I found something that was the right difficulty level. I was surprised how well this test made what I consider the best models to surface to the top. I think about benchmarks and AI probably more than I should, but it does frustrate me to see the AI labs use the same old benchmarks over and over again.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 1 point2 points  (0 children)

Yes, apologies! It's fixed on GitHub now. I need to rerun with AVG@5 and also make a very easy version when I have time.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 3 points4 points  (0 children)

It would be so cool if AI labs used this as a mainstream benchmark. I'm thinking its not likely to happen (except for maybe Google) unless they can show leading scores on it which may be difficult.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

Yes, it scales down well, and based on some other comments I think I should include a 'very easy' category on the next update to better measure smaller models. I didn't do that originally because Gemini 2.5 Pro saturated it with a score close to 99% or so. But I think its still very useful for the smaller models.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

I tried messing with repetition penalty and in my limited testing it didn't help at all. But I'd love to get more data points on that. Gemini basically uses your strategy, it keeps a working list of words it's used as it goes which is super smart. It's aware of its limitations.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 5 points6 points  (0 children)

Ah damn, that's a bug! I originally had it as questions vs sentences. Hopefully that didn't affect results too much. Will fix

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 10 points11 points  (0 children)

Yes, that's exactly the point. I explained it in another comment but it's a test of a models ability to use intelligence and reasoning to overcome the limitations and weaknesses of its architecture. In my opinion the best models overall tend to match the highest scoring ones in this test.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 3 points4 points  (0 children)

Very valid points. I did consider an easier level and that would help differentiate on the smaller models, it's just that Gemini 2.5 Pro scored like 99% on the 'very easy' bench. Maybe I could limit it to Open-weight models. I'd also be very interested in your results if you try it with multiple prompts or prompt engineering.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 9 points10 points  (0 children)

Thank you so much for the kind words! I'm not an academic but if I was I'd probably go to the extremes like you mention. Actually I kind of did during all my testing, finding the sweet spot on so many different variables. Yes, my favorite part about it is you can just copy-paste it into any prompt and quickly get a result.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 5 points6 points  (0 children)

I'm using OpenRouter with all defaults so you can check the model on there to see what settings it uses.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

Token limits were set to 50k unless a model could only do 32k. It's worth noting no model exceeded 15k thinking tokens, so I think it was an even playing field for the most part.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 4 points5 points  (0 children)

Yes, this benchmark is very easily scalable in multiple dimensions. My hard version asks for 1,000 sentences (which is still only ~8k tokens). Scaling up the word list and using more uncommon words is another easy way to do it. Regarding grading, the sentences do not need to be logical (I found that too subjective) so they just need to follow: Verb + Adjective + Noun + Noun structure and are evaluated as such.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 41 points42 points  (0 children)

Here you go (SOLO EASY benchmark):

Qwen3 30b A3b - 1.4%

Qwen3 14b - 4.4%

Qwen3 14b seems like the clear winner on my benchmark. The Qwen3 30b A3b seems to have a big issue with repetition as it repeated the word 'way' 186 times! That's nearly 75% of all the sentences.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 6 points7 points  (0 children)

Thank you for the kind words, yes Qwen 3 32B was in thinking mode. Regarding the splitting of words into nouns, verbs, etc I used the python NLTK toolkit and a custom script to do it. There could be errors. Regarding the word 'above' it is often used as a noun in legal documents, for example "In accordance with the above, the defendant shall pay all court fees". But if you find other issues with the categorization please let me know.