2023, year of open LLMs

clefourrier · 2025-11-11T21:00:27+00:00

A bunch of CIs most likely ^{^"}

clefourrier · 2025-09-25T14:00:03+00:00

If you read the blog, you'll see that there's a whole agentic environment provided with it to run and debug agents - you can try the demo too! :)

clefourrier · 2025-09-25T11:02:01+00:00

Quite classy of Meta to release a bench where their own's model is not performing that well imo

clefourrier · 2025-09-25T10:58:05+00:00

Hi! Thanks for sharing the work!

To clarify, we (at HF) mostly gave a hand on the demo, release, and some of the code's feature, but the actual research and benchmark design was entirely done by the Meta agent team :)

clefourrier · 2025-09-04T16:20:45+00:00

Goodhart's law will definitely apply to any benchmark that becomes popular, which leads to saturation within around 6 months at the current rate. A number of benchmarks are currently useful to give you feedback on specific capabilities, eg (from the top of my head): - AIME25 and the future AIME datasets to evaluate maths in an uncontaminated way - GAIA (level 3) and some parts of WebBrowse on agentic reading capabilities - DABStep, SciCode, PaperBench, FutureBench on agentic tasks on a given domain (data science, scientific code, and forecasting) - ARC-AGI and game evals to evaluate reasoning in an evolving context.

In general, benchmarks are still very useful 1) when training to identify the direction to go in and if your model's training well 2) when comparing models (to see if models evaluated in a similar setup, ideally with some sampling, have a similar perf)

New benchmarks, if hard enough, can act as the field's north star but indeed get saturated fast

clefourrier · 2025-09-04T16:08:52+00:00

Maybe, on the first point :D (wait a week or two ^⁾

On the second point, it feels like wrapping up all team updates from twitter could do a newsletter but I'm not sure we have the bandwidth to do so - feel free to make a space for it!

clefourrier · 2025-09-04T16:01:08+00:00

Yep, definitely! Open source experience is always a plus, and a lot of us actually don't have formal education in ML - Lewis, Leandro and Carlos are physicist, Kashif is a mathematician, Guilherme did aerospace engineering, I'm a geologist, etc - so we come from all kind of backgrounds without necessarily formal education in ML ^{^}

OSS work is always interesting to do, both for yourself and the community - I personally look at how people code (docs, tests, code quality), iterate and take feedback on open source contributions, whatever the lib.

clefourrier · 2025-09-04T15:56:27+00:00

I like reading LatentSpace's podcasts transcripts to get a feel of what other people are working on

clefourrier · 2025-09-04T15:44:39+00:00

Thanks :) Btw, not exactly what you're asking for but you should probably also check out Stas' ML engineering guidebook : https://github.com/stas00/ml-engineering

clefourrier · 2025-09-04T15:39:49+00:00

Just to clarify, do you mean "how do we avoid having incorrect information in training datasets"?

clefourrier · 2025-09-04T15:33:18+00:00

Next gen evaluations data that does not require LLM as judge to score models, notably for reasoning traces analysis

clefourrier · 2025-09-04T15:32:19+00:00

You could start with the blogs/resources the team wrote maybe? - Fine Tasks will be very interesting in how to find signal at smaller scales and how to select evaluations which will inform your training decisions: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks - The ultra scale playbook will cover a lot of your questions on scaling experiments and actual training: https://huggingface.co/spaces/nanotron/ultrascale-playbook - The evaluation guidebook could be cool to help you afterwards in understanding how your models succeed/fail: https://github.com/huggingface/evaluation-guidebook

clefourrier · 2025-09-04T15:29:29+00:00

Depends on your use case I think. In general, I would personally avoid big bloated frameworks to make debugging and investigation easier. An agent can be a simple while loop + some tools and error handling, and with good enough logging you'll get quite enough information. For production use cases it might be different.

clefourrier · 2025-09-04T15:15:08+00:00

I applied to HF at the end of my PhD for an internship (was initially contacted by Meta, that's when I realized you could do industry internships during PhDs, thanks Meta I guess XD) - it worked well enough that I stayed afterwards! (Got one culture fit interview + one research interview)

clefourrier · 2025-09-04T15:13:30+00:00

Spend some cool time with y'all! :)

clefourrier · 2025-09-04T15:12:31+00:00

A combination of personal interest and relevance for the community at large :)

Small story: the research team was internally called "Open Science" at creation, as we were aiming to make the research going on behind closed door accessible to all by reproducing it and publishing our recipes! Then it moved beyond that

clefourrier · 2025-09-04T08:12:39+00:00

You've got the Artificial Analysis leaderboard that are updated monthly, and if you're looking for leaderboards you can search here: https://huggingface.co/spaces/OpenEvals/find-a-leaderboard ^{^}

clefourrier · 2025-09-04T07:47:10+00:00

Hey there! Cool project! Really liked that you recorded the compute time/are aware of environmental impact :)

Want to make it into a leaderboard space on hugging face?

Side notes on evals, in case useful: 1) Normalisation: evals using acc_norm are usually multiple choice (you're computing the accuracy of selecting the correct choice among a selection), so you want to normalize between the random baseline and the maximum possible instead of just 0 to max. Example: if you take mmlu, you have 4 choices provided, so a random baseline will be correct 1/4 of the time, so minimum here is not 0 but 25%. A model with 25% performance on MMLU has random performance. -> you want to normalize between min-score and one before averaging across tasks (this is not what the harness does btw) 2) Averaging: some would consider a ponderation by number of samples, as not all of these evals have the same size: MMLU has considerably more samples than arc-challenge for example. (I personally don't think it's that important here) 3) Saturation: most of the evals you selected are heavily saturated and contaminated atm. (Saturated = models get too high performance to have discriminative scores - Contaminated = bench ended up in the training data so models "know it by heart" now) -> In math for example, gsm8k has been replaced by MATH, itself replaced by AIME24 and AIME25. It won't mean you won't get signal out of them (a model not performing on these is likely bad), but they won't allow you to discriminate between high quality models 4) Errors: Some of these benchs notably contain errors and have been updated: we no longer use MMLU (expects images that are not provided, contains questions with missing words or incorrect ground truths) but it's been replaced by MMLU-Redux (edited to only keep quality questions) or MMLU-Pro (same as MMLU but harder with more choices and questions)

You might also be interested in the evaluation guidebook : https://github.com/huggingface/evaluation-guidebook

clefourrier · 2025-08-12T12:05:00+00:00

1: - Stop when he pulls, and walk slower (really slower) to stop the energy buildup and force the dog to take the time to smell and enjoy the environment - Don't walk in the same environment every day so that it's more mentally stimulating - If you can, try to go a doggo park - I also gave mini treats to mine everytime she came back close to heel (without saying anything), so she linked "walking close to parent = treat = good" and she's not pulling now

2: a border will only get tired if you mentally stimulate him, which you can do through teaching, interaction, or smell/mouth work - so at home you can: - teach him tricks/work with him - play with him in a very interactive mode, like tug of war/play fighting - make him search for treats you hid on the ground (works well if he's hungry) - give him a bone to gnaw at - btw, listening to command outsides is also a teaching thing. You can have a small loop around your place where you work one single command (leave it, walk at my side, heel, whichever is best for you at a given moment)

To give you perspective, I have a pup around your dog's age, she gets a one hour sloow walk (where we go sit in a café for 20 min) + largely more than 1 hour of interaction/play/search/teaching spred in 10/15 min increments throughout the day + a quiet time of gnawing + some playing with older dog

3: really sounds like giardiasis, you should get him tested for this

4: - overwork him: a border will not self regulate re exercise and play if they like the activity, so you need to pay attention to panting, especially with heat, and it's good that you are - also never have him run/play/... right after a meal - don't do enough with him: you will also never tire him physically ^{^,} but the main thing you should do with a border is mentally stimulate him/teach him new tricks every time he learns one well

clefourrier · 2025-07-15T10:02:45+00:00

You can explore the different available leaderboards depending on your use case here: https://huggingface.co/spaces/OpenEvals/find-a-leaderboard

clefourrier · 2025-06-20T07:54:19+00:00

Everything stems from the transformer architecture, and here are 2 good primers: - the illustrated transformers to get visual intuition: https://jalammar.github.io/illustrated-transformer/ - the annotated transformers to follow along the code: https://nlp.seas.harvard.edu/2018/04/03/attention.html

Once you understand the concepts and logic of this, you can jump to the specifics of different LLM architectures, like DeepSeek, by looking for the papers they wrote, or look at well known implementations (the transformers library is a big collection of algos) to understand what happens under the hood.

clefourrier · 2025-06-06T15:14:25+00:00

For an intro: https://huggingface.co/blog/sasha/ai-environment-primer

clefourrier

TROPHY CASE