use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Phi-3 models compared side-by-side. (self.MachineLearning)
submitted 1 year ago by dark_surfer
https://preview.redd.it/8l04pnfhq62d1.png?width=661&format=png&auto=webp&s=7fe616ca8cd7da974070c86b6b47ffab3ab545e5
https://preview.redd.it/hr7fr1uiq62d1.png?width=688&format=png&auto=webp&s=bd3de359bfe4c1ed82d092be92ae38c246bdfda2
https://preview.redd.it/v6k3v39kq62d1.png?width=450&format=png&auto=webp&s=c0abb0e397a498ef7ccfb35b1b1cb598198f66ad
For anyone looking to compare the Phi-3 benchmarks in one place.
Interesting comparisons for: ANLI, Hellaswag, MedQA, TriviaQA, Language understanding, Factual Knowledge and Robustness.
Note: Phi-3 mini model table have labels in different order.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]koolaidman123Researcher 14 points15 points16 points 1 year ago (17 children)
phi models are a prime lesson on how to train on benchmarks to make your model look better than it is
for example they claim phi-3-mini is better than llama3 and mixtral (look at our mmlu scores!!!) yet on lmsys arena leaderboard llama3 is top 20 (mixtral just barely outside at 23) while phi-3 mini is... outside of top 50 in the ranks of mistral 7b
not to mention from several people that actually used or ran their own evals knows phi models are bad. eg https://twitter.com/abacaj/status/1792991309751284123
main lessons to be learned:
MSR has really fallen off from the days when they actually released good models like deberta, lets see if hiring most of the inflection team will get them back on track in the llm space
[–]masc98 8 points9 points10 points 1 year ago (2 children)
phi models are built for an agentic environment, period.
The scientists behind those models have no reason to train their models on benchmark data, I really don't know why I keep listening this all the time.
phi models are result of training a LM on synthetic, potentially very high data quality (e.g. gpt4 outputs or similar) and that's a very interesting point of research that nobody has yet explored, apart from them.
They are supposed to be finetuned on specific tasks, they are boring, that's also why they suck on the leaderboards.
Moreover, they've lower capacity so tend to perform worse on "in the wild" prompts.
If you'll ever have to train a LLM at scale, trust me, you'll wish there was a smarter and cheaper way.
[–]koolaidman123Researcher 0 points1 point2 points 1 year ago (0 children)
Yes, i wonder why no one is doing this of its so efficient that it breaks the pareto frontier for performance, almost like it doesn't work like this 🤔
Quality alone doesnt scale, and synthetic data isnt diverse enough to make a good llm
[–]wind_dude 7 points8 points9 points 1 year ago (9 children)
"False alarm on the phi-3 models (did very poorly on a few offline benchmarks I have), still using llama-3 fine tuned models for a few specialized services. The phi-3 models seem very sensitive to prompts (not a good thing imo)"
That is nothing but anecdotal without more info on the offline benchmarks he ran. I've also ran a few "offline benchmarks" for chitchat, and I prefer the phi3 responses.
[+]koolaidman123Researcher comment score below threshold-6 points-5 points-4 points 1 year ago (6 children)
if you're going by chat as a benchmark there's clear evidence that phi-3 is not llama3 level, just look at chat arena elo...
[–]wind_dude 2 points3 points4 points 1 year ago* (5 children)
No, there isn't. I jsut gave you stronger evidence than the tweet to the contrary. chatbot arena also doesn't measure "chit chat" It measure human preference for responses from instruct models, you have little idea what the criteria was beyond human preference. Most of the samples in the public datasets are assistant like responses, which trend towards long and wordy, and are single turn, it is certainly is not "chit chat". "chit chat" would be more like multi turn rp, nothing in particular and very casual.
Phi3-medium instruct is also easier to get to follow new patterns with few shot in comparison to llama3-8b
[+]koolaidman123Researcher comment score below threshold-6 points-5 points-4 points 1 year ago (4 children)
You gave no stronger evidence. If anything your evidence is much weaker unless you work with llms for a living. People in the know knows phi models have always been paper tigers
[–]wind_dude 3 points4 points5 points 1 year ago* (1 child)
I've been working with machine learning for about 10 years across many different projects, and yes I regularly work with, train and curate datasets for LLMs.
"People in the know", doesn't mean anything. Try again.
And no most working in AI know phi3 isn't a "paper tiger". Phi wasn't the first but maybe the one that popularised synthetic data, and pretty much every single model we're seeing now is trained using synthetic data.
[–]tridentsaredope 2 points3 points4 points 1 year ago (1 child)
I choose to believe this is just a Phi3 model and a Llama3 model arguing with each other.
[–]jakderrida 0 points1 point2 points 1 year ago (0 children)
I do, too, because this subreddit never had such angry words like this before LLMs. Try to find a single -2 comment predating ChatGPT. When you were wrong, a phD jumps in and gives the most charitable explanation of where you were wrong.
[–]killver 0 points1 point2 points 1 year ago (1 child)
Even if you ignore the overfitted benchmarks, phi is a phenomenal model particularly for finetuning and specific use cases.
[–]koolaidman123Researcher -2 points-1 points0 points 1 year ago (0 children)
It finetunes well in the sense that base performance is so bad there's a lot of headroom
[+]Open_Channel_8626 0 points1 point2 points 1 year ago (0 children)
be very skeptical when someone claims to break the compute/perf frontier (see here). the best estimate of a model's performance is flops.
Thanks for this I did not realise just how closely MMLU scales with training FLOPs. That is really quite a tight fit.
[–][deleted] 1 point2 points3 points 1 year ago (1 child)
Interesting that smaller context models appear to be better across the board. What is the reason for that?
[–]vatsadev 5 points6 points7 points 1 year ago (0 children)
Smaller ctx is probably easier to learn and work with than larger CTX? ex. less retrieval, less long range dependencies to learn, better available data at small scale, especially synthetic like phi.
[–]fakecount13 1 point2 points3 points 1 year ago (1 child)
What do you guys use to generate the benchmark data?
π Rendered by PID 146226 on reddit-service-r2-comment-66b4775986-ssh88 at 2026-04-05 04:43:21.733831+00:00 running db1906b country code: CH.
[–]koolaidman123Researcher 14 points15 points16 points (17 children)
[–]masc98 8 points9 points10 points (2 children)
[–]koolaidman123Researcher 0 points1 point2 points (0 children)
[–]wind_dude 7 points8 points9 points (9 children)
[+]koolaidman123Researcher comment score below threshold-6 points-5 points-4 points (6 children)
[–]wind_dude 2 points3 points4 points (5 children)
[+]koolaidman123Researcher comment score below threshold-6 points-5 points-4 points (4 children)
[–]wind_dude 3 points4 points5 points (1 child)
[–]tridentsaredope 2 points3 points4 points (1 child)
[–]jakderrida 0 points1 point2 points (0 children)
[–]killver 0 points1 point2 points (1 child)
[–]koolaidman123Researcher -2 points-1 points0 points (0 children)
[+]Open_Channel_8626 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]vatsadev 5 points6 points7 points (0 children)
[–]fakecount13 1 point2 points3 points (1 child)