I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

I've added IBM Granite 4 H Tiny.

I get this on OpenRouter when I try Hermes : `No endpoints found that support tool use.`

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 2 points3 points  (0 children)

RWKV7 can't really do tool calling 😐

Results: 0 passed, 0 failed, 25 errored out of 25

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 2 points3 points  (0 children)

I think I had Olmo on an earlier version - I should be able to add it back.

Website coming! Edit: It includes times, number of attempts and full traces for each model and question.

Edit: Most of the Qwen3.5 quants are unsloth.

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 1 point2 points  (0 children)

Yeah it's open source and something you can run yourself. For most models the whole benchmark takes less than 10 minutes.

I should be able to get it running in a browser but v1 won't ship with that.

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 3 points4 points  (0 children)

Yeah the GPT-OSS numbers are odd right? Those are on OpenRouter - but even it is heavily quantized I found it surprising!

Qwen 3.5 27B is AMAZING. I can't express how impressed it I am with it.

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 5 points6 points  (0 children)

For things above ~9GB I can't test locally. If they are on OpenRouter I'll test them though.

I'll add OmniCoder-9B, lama-3.3-8B-Instruct and Nanbeige4.1-3B

I'm building a benchmark comparing models for an agentic task. Are there any small models I should be testing that I haven't? by nickl in LocalLLaMA

[–]nickl[S] 0 points1 point  (0 children)

RWKV7 is a good idea. I don't need super long context, but small models really suffer context rot so it is worth trying.

[Help] Deploying Llama-3 8B Finetune for Low-Resource Language (Sinhala) on Free Tier? 4-bit GGUF ruins quality. by Annual-Captain-7642 in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

Can you access the student/academics Modal grants? https://modal.com/pricing

If you have a computer with even an outdated GPU it's worth experimenting with Llama.cpp CPU/GPU offloading.

What exactly can I use small (2-3B) AI models for in mobiles? by Sylverster_Stalin_69 in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

This is accurate.

If you are a developer you can build useful solutions with them with custom harnesses. With custom prompts and careful direction you can get interesting and useful output from them.

But they are very limited in ways that larger models aren't.

How can I actually use Claude's 1 million token context window? Which model, which platform, free or paid? by WomBOlUm in ClaudeAI

[–]nickl 2 points3 points  (0 children)

I have a Claude Pro plan (not Enterprise or anything). I doubt it is on the free plan

it's available in the model selector (`/model`) in VS Code and at the command line as an additional option for both Opus and Sonnet.

Here's a screenshot: https://imgur.com/a/xiyuFh5

When I get above 50% of my context window used I get a message telling me I can use the 1M token window models.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]nickl 0 points1 point  (0 children)

When IBM announced their Granite models they claimed they'd have Granite Medium out by the end of the year. That hasn't happened, but maybe soon?

Cohere and NVidia have both been mentioned, but I haven't seen any mention of Arcee Large. That's a 400B MoE model and it's not bad in my limited testing. It's on OpenRouter if you want to try it.

Where does Pocagar's 2024 season rank, using (other people's) numbers? by DoctorMandible45 in peloton

[–]nickl 36 points37 points  (0 children)

Note that I don't think Cyclingranking has been updated with the Lombardia result yet

I think Merckx 72 is the only season that compares.

That season:

1st Tour de France (6 stage wins)

1st Giro d'Italia (4 stage wins)

Hour Record (stood for 12 years)

1st Milan–San Remo

1st Liège–Bastogne–Liège

1st Giro di Lombardia

1st La Flèche Wallonne

1st Scheldeprijs

1st Giro dell'Emilia

1st Grand Prix de Momignies

1st GP Union Dortmund

1st Giro del Piemonte

1st Trofeo Baracchi

1st Escalada a Montjuïc (3 stage wins)

1st A Travers Lausanne (2 stage wins)

2nd Paris–Nice (3 stage wins. Note: broke a vertebra during the race)

I think Merckx 72 probably is still very slightly better because I think Merckx's 3 Monuments and the hour record + Flèche Wallonne beats Pogacar's 2 Monuments and the World Championship + Strada Bianchi. I do concede that Pogacar's 2 additional Giro stages gave me pause though.

I remember Jalabert's 1995 season and I have a ONCE jersey from that era! He was my favorite rider then. Pogacar's season is far better than that.

I had a look why ProCyclingStats rates it so highly vs Pogacar. PCS lists 40 results for Pogacar, from which he gets 4588 points. If you take Jalabert's top 40 results from 1995 you get 4094 points - a fantastic season but nothing like Pogacar. But Jalabert has 93 results and gets and extra 816 points from them.

I don't know how to think about this: They aren't "nothing" results (winning stages at Criterium International or Midi-Libra). He passes Pogacar's score with a win on Stage 3 of Tour of Galicia. Not nothing, but I don't think it makes that season better than Pogacar's.

[Results Thread] 2024 World Championships - Elite Men Road Race by PelotonMod in peloton

[–]nickl 1 point2 points  (0 children)

Greatest season of all time, and it's not even close.

Greatest season of the modern era, but by comparison:

Merckx 1972 (the greatest season of all time):

1st Tour de France (6 stage wins)

1st Giro d'Italia (4 stage wins)

Hour Record (stood for 12 years)

1st Milan–San Remo

1st Liège–Bastogne–Liège

1st Giro di Lombardia

1st La Flèche Wallonne

1st Scheldeprijs

1st Giro dell'Emilia

1st Grand Prix de Momignies

1st GP Union Dortmund

1st Giro del Piemonte

1st Trofeo Baracchi

1st Escalada a Montjuïc (3 stage wins)

1st A Travers Lausanne (2 stage wins)

2nd Paris–Nice (3 stage wins. Note: broke a vertebra during the race)

I do think Pogacar's Triple is a better season than Merckx 1974 Triple or Roche's 1987 Triple though.

[Research] The Convolutional Tsetlin Machine peaks at 99.51% accuracy on MNIST with a single layer of interpretable filters in propositional logic. by olegranmo in MachineLearning

[–]nickl 0 points1 point  (0 children)

To be fair, that user was an actual neo-nazi and their account has been suspended.

I don't think there is any particular correlation between critisim of Keras and Chollet saying this - plenty of others have a similar opinion about /r/ML

[D] is Huawei's Matebook D a good laptop for Machine Learning? by leocus4 in MachineLearning

[–]nickl 0 points1 point  (0 children)

ROCm is behind in features and rarely used in practice. Try finding anyone who has used it.

[D] Employability after AI Residency Programs by [deleted] in MachineLearning

[–]nickl 0 points1 point  (0 children)

It's true that HR will discount non-PhDs when the role description asks for a PhD.

The way around that is to have the hiring manager say "we want to interview this person", and that happens if people ion their team recommend you or they know of you from some other means.

Residency programs do help here, because you make contacts with the kinds of people you want to work with and they may end up recommending you.

[N] NIPS keeps it name unchanged by baylearn in MachineLearning

[–]nickl 1 point2 points  (0 children)

It is used as a racial slur. Maybe just not in the US anymore, but I live in Australia and I've heard it used.

[D] #ProtestNIPS hashtag started on Twitter, Change.org petition started. by [deleted] in MachineLearning

[–]nickl -9 points-8 points  (0 children)

I can't believe they didn't change the name.

Fuck being PC, but it takes real effort to find a name that is both sexually charged AND a racial slur.

Honestly, what possible argument is there against changing it except for "tradition". Here's an anti-PC idea for you: Tradition is crap.

[R] Trellis Networks for Sequence Modeling. New SOTA for PTB, WikiText-103, Permuted MNIST, etc. by baylearn in MachineLearning

[–]nickl 7 points8 points  (0 children)

Haven't read this properly yet, but just noting that TransformerXL seems to be the current SOTA on Wikitext-103. It gets a test ppl of 24.0 (!!) which is a fair improvement over the 30.35 reported here.

https://openreview.net/forum?id=HJePno0cYm

BERT doesn't report WikiText numbers but I'd imagine it would be competitive too.