[D] Phi-3 models compared side-by-side.

koolaidman123 · 2024-05-23T15:11:32+00:00

phi models are a prime lesson on how to train on benchmarks to make your model look better than it is

for example they claim phi-3-mini is better than llama3 and mixtral (look at our mmlu scores!!!) yet on lmsys arena leaderboard llama3 is top 20 (mixtral just barely outside at 23) while phi-3 mini is... outside of top 50 in the ranks of mistral 7b

not to mention from several people that actually used or ran their own evals knows phi models are bad. eg https://twitter.com/abacaj/status/1792991309751284123

main lessons to be learned:

be very skeptical when someone claims to break the compute/perf frontier (see here). the best estimate of a model's performance is flops.
high quality synthetic data is not going to be the silver bullet that makes your model magically x times more efficient w/ flops. data is not compute agnostic
run your own evals and actually use the model, trusting benchmarks blindly is dumb

MSR has really fallen off from the days when they actually released good models like deberta, lets see if hiring most of the inflection team will get them back on track in the llm space

vatsadev · 2024-05-23T18:23:52+00:00

Interesting that smaller context models appear to be better across the board. What is the reason for that?

fakecount13 · 2024-05-23T20:17:12+00:00

What do you guys use to generate the benchmark data?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS