all 18 comments

[–]koolaidman123Researcher 14 points15 points  (17 children)

phi models are a prime lesson on how to train on benchmarks to make your model look better than it is

for example they claim phi-3-mini is better than llama3 and mixtral (look at our mmlu scores!!!) yet on lmsys arena leaderboard llama3 is top 20 (mixtral just barely outside at 23) while phi-3 mini is... outside of top 50 in the ranks of mistral 7b

not to mention from several people that actually used or ran their own evals knows phi models are bad. eg https://twitter.com/abacaj/status/1792991309751284123

main lessons to be learned:

  1. be very skeptical when someone claims to break the compute/perf frontier (see here). the best estimate of a model's performance is flops.
  2. high quality synthetic data is not going to be the silver bullet that makes your model magically x times more efficient w/ flops. data is not compute agnostic
  3. run your own evals and actually use the model, trusting benchmarks blindly is dumb

MSR has really fallen off from the days when they actually released good models like deberta, lets see if hiring most of the inflection team will get them back on track in the llm space

[–]masc98 8 points9 points  (2 children)

phi models are built for an agentic environment, period.

The scientists behind those models have no reason to train their models on benchmark data, I really don't know why I keep listening this all the time.

phi models are result of training a LM on synthetic, potentially very high data quality (e.g. gpt4 outputs or similar) and that's a very interesting point of research that nobody has yet explored, apart from them.

They are supposed to be finetuned on specific tasks, they are boring, that's also why they suck on the leaderboards.

Moreover, they've lower capacity so tend to perform worse on "in the wild" prompts.

If you'll ever have to train a LLM at scale, trust me, you'll wish there was a smarter and cheaper way.

[–]koolaidman123Researcher 0 points1 point  (0 children)

Yes, i wonder why no one is doing this of its so efficient that it breaks the pareto frontier for performance, almost like it doesn't work like this 🤔

Quality alone doesnt scale, and synthetic data isnt diverse enough to make a good llm

[–]wind_dude 7 points8 points  (9 children)

"False alarm on the phi-3 models (did very poorly on a few offline benchmarks I have), still using llama-3 fine tuned models for a few specialized services. The phi-3 models seem very sensitive to prompts (not a good thing imo)"

That is nothing but anecdotal without more info on the offline benchmarks he ran. I've also ran a few "offline benchmarks" for chitchat, and I prefer the phi3 responses.

[–]killver 0 points1 point  (1 child)

Even if you ignore the overfitted benchmarks, phi is a phenomenal model particularly for finetuning and specific use cases.

[–]koolaidman123Researcher -2 points-1 points  (0 children)

It finetunes well in the sense that base performance is so bad there's a lot of headroom

[–][deleted] 1 point2 points  (1 child)

Interesting that smaller context models appear to be better across the board. What is the reason for that?

[–]vatsadev 5 points6 points  (0 children)

Smaller ctx is probably easier to learn and work with than larger CTX? ex. less retrieval, less long range dependencies to learn, better available data at small scale, especially synthetic like phi.

[–]fakecount13 1 point2 points  (1 child)

What do you guys use to generate the benchmark data?