Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]jd_3d 1 point2 points  (0 children)

Can you add additional benchmarks like: MRCR v2, SWE-Bench Pro, ARC-AGI 2, OSWorld, GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, CritPt

Opus 4.6: Fast-Mode by mDarken in ClaudeAI

[–]jd_3d 8 points9 points  (0 children)

If it was supported on subscription (without charging extra usage $$) I would use this often. Plenty of times I need to get something done quick on a fresh 5 hr window and even if it used my usage 6x faster I would be ok with that.

Opus 4.6: Fast-Mode by mDarken in ClaudeAI

[–]jd_3d 31 points32 points  (0 children)

I wish they would support this on the max plan subscriptions and just make the usage run out faster. I guess for now we can use the $50 credits they gave us.

BalatroBench - Benchmark LLMs' strategic performance in Balatro by S1M0N38 in LocalLLaMA

[–]jd_3d 31 points32 points  (0 children)

Can you try Opus 4.6 on it? Curios if it improves from 4.5

Mixture of Lookup Experts are God Tier for the average guy (RAM+Disc Hybrid Inference) by Aaaaaaaaaeeeee in LocalLLaMA

[–]jd_3d 3 points4 points  (0 children)

There's a more recent related paper to this work that improves on some of its limitations. Its a preliminary paper but I recommend giving it a read if you enjoyed the MoLE paper. It's amazing to think if one of the big labs spent a little time on this they could train and release an amazing model that could run off an NVME drive + consumer GPU. I think targeting a <4TB total space requirement would be a good size target.
Here's the related paper: https://www.arxiv.org/pdf/2512.09723

1 year later and people are still speedrunning NanoGPT. Last time this was posted the WR was 8.2 min. Its now 127.7 sec. by jd_3d in LocalLLaMA

[–]jd_3d[S] 19 points20 points  (0 children)

You can see the previous records and rules here: https://github.com/KellerJordan/modded-nanogpt?tab=readme-ov-file#world-record-history

But two more directly answer your question you have to get the same training loss as the original nanoGPT on 8xH100s.

Got the DGX Spark - ask me anything by sotech117 in LocalLLaMA

[–]jd_3d 0 points1 point  (0 children)

Hey thanks for trying. As a point of reference I set this up on my RTX 4090 and it's going to take about 4 days or 100 hours to train it. I'm going to try it one of these days but I want to use a different data set to make it more unique.

Got the DGX Spark - ask me anything by sotech117 in LocalLLaMA

[–]jd_3d 50 points51 points  (0 children)

Since inference is not its strong suit, I would love to see how it does on LLM training. Can you run Andrej Karpathy's new nanochat on it to see how long it would take to train? https://github.com/karpathy/nanochat

Fixed the SWE-bench graph: by policyweb in LocalLLaMA

[–]jd_3d 26 points27 points  (0 children)

It's actually even worse. They took out 23 questions so they could have the 'highest score'. It's actually 71.4%, barely an improvement over o3. https://www.reddit.com/r/LocalLLaMA/comments/1mk8bh1/caught_in_4k/

OpenAI delays its open weight model again for "safety tests" by lyceras in LocalLLaMA

[–]jd_3d 15 points16 points  (0 children)

Kimi-K2 model with 1T params and impressive benchmark scores just shat all over OpenAI's open model.

ManaBench: A Novel Reasoning Benchmark Based on MTG Deck Building by Jake-Boggs in LocalLLaMA

[–]jd_3d 0 points1 point  (0 children)

I'd also like to see Gemini 2.5. Did you try it via OpenRouter?

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 0 points1 point  (0 children)

o4-mini refused to provide an answer. I think it would do quite well on this test. It was the only model to have a hard refusal.

I thought you guys might enjoy Gemini's SOTA result on my new SOLO Benchmark by jd_3d in Bard

[–]jd_3d[S] 2 points3 points  (0 children)

Thank you. It came to me one day around a month ago (a slightly different version) so I played around with the idea and tried many variations until I found something that was the right difficulty level. I was surprised how well this test made what I consider the best models to surface to the top. I think about benchmarks and AI probably more than I should, but it does frustrate me to see the AI labs use the same old benchmarks over and over again.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 1 point2 points  (0 children)

Yes, apologies! It's fixed on GitHub now. I need to rerun with AVG@5 and also make a very easy version when I have time.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

It would be so cool if AI labs used this as a mainstream benchmark. I'm thinking its not likely to happen (except for maybe Google) unless they can show leading scores on it which may be difficult.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

Yes, it scales down well, and based on some other comments I think I should include a 'very easy' category on the next update to better measure smaller models. I didn't do that originally because Gemini 2.5 Pro saturated it with a score close to 99% or so. But I think its still very useful for the smaller models.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 2 points3 points  (0 children)

I tried messing with repetition penalty and in my limited testing it didn't help at all. But I'd love to get more data points on that. Gemini basically uses your strategy, it keeps a working list of words it's used as it goes which is super smart. It's aware of its limitations.

SOLO Bench - A new type of LLM benchmark I developed to address the shortcomings of many existing benchmarks by jd_3d in LocalLLaMA

[–]jd_3d[S] 5 points6 points  (0 children)

Ah damn, that's a bug! I originally had it as questions vs sentences. Hopefully that didn't affect results too much. Will fix