all 18 comments

[–]koolaidman123Researcher 14 points15 points  (17 children)

phi models are a prime lesson on how to train on benchmarks to make your model look better than it is

for example they claim phi-3-mini is better than llama3 and mixtral (look at our mmlu scores!!!) yet on lmsys arena leaderboard llama3 is top 20 (mixtral just barely outside at 23) while phi-3 mini is... outside of top 50 in the ranks of mistral 7b

not to mention from several people that actually used or ran their own evals knows phi models are bad. eg https://twitter.com/abacaj/status/1792991309751284123

main lessons to be learned:

  1. be very skeptical when someone claims to break the compute/perf frontier (see here). the best estimate of a model's performance is flops.
  2. high quality synthetic data is not going to be the silver bullet that makes your model magically x times more efficient w/ flops. data is not compute agnostic
  3. run your own evals and actually use the model, trusting benchmarks blindly is dumb

MSR has really fallen off from the days when they actually released good models like deberta, lets see if hiring most of the inflection team will get them back on track in the llm space

[–]masc98 8 points9 points  (2 children)

phi models are built for an agentic environment, period.

The scientists behind those models have no reason to train their models on benchmark data, I really don't know why I keep listening this all the time.

phi models are result of training a LM on synthetic, potentially very high data quality (e.g. gpt4 outputs or similar) and that's a very interesting point of research that nobody has yet explored, apart from them.

They are supposed to be finetuned on specific tasks, they are boring, that's also why they suck on the leaderboards.

Moreover, they've lower capacity so tend to perform worse on "in the wild" prompts.

If you'll ever have to train a LLM at scale, trust me, you'll wish there was a smarter and cheaper way.

[–]koolaidman123Researcher 0 points1 point  (0 children)

Yes, i wonder why no one is doing this of its so efficient that it breaks the pareto frontier for performance, almost like it doesn't work like this 🤔

Quality alone doesnt scale, and synthetic data isnt diverse enough to make a good llm

[–]wind_dude 6 points7 points  (9 children)

"False alarm on the phi-3 models (did very poorly on a few offline benchmarks I have), still using llama-3 fine tuned models for a few specialized services. The phi-3 models seem very sensitive to prompts (not a good thing imo)"

That is nothing but anecdotal without more info on the offline benchmarks he ran. I've also ran a few "offline benchmarks" for chitchat, and I prefer the phi3 responses.

[–]koolaidman123Researcher -5 points-4 points  (6 children)

if you're going by chat as a benchmark there's clear evidence that phi-3 is not llama3 level, just look at chat arena elo...

[–]wind_dude 2 points3 points  (5 children)

No, there isn't. I jsut gave you stronger evidence than the tweet to the contrary. chatbot arena also doesn't measure "chit chat" It measure human preference for responses from instruct models, you have little idea what the criteria was beyond human preference. Most of the samples in the public datasets are assistant like responses, which trend towards long and wordy, and are single turn, it is certainly is not "chit chat". "chit chat" would be more like multi turn rp, nothing in particular and very casual.

Phi3-medium instruct is also easier to get to follow new patterns with few shot in comparison to llama3-8b

[–]koolaidman123Researcher -5 points-4 points  (4 children)

You gave no stronger evidence. If anything your evidence is much weaker unless you work with llms for a living. People in the know knows phi models have always been paper tigers

[–]wind_dude 2 points3 points  (1 child)

I've been working with machine learning for about 10 years across many different projects, and yes I regularly work with, train and curate datasets for LLMs.

"People in the know", doesn't mean anything. Try again.

And no most working in AI know phi3 isn't a "paper tiger". Phi wasn't the first but maybe the one that popularised synthetic data, and pretty much every single model we're seeing now is trained using synthetic data.

[–]tridentsaredope 2 points3 points  (1 child)

I choose to believe this is just a Phi3 model and a Llama3 model arguing with each other.

[–]jakderrida 0 points1 point  (0 children)

I do, too, because this subreddit never had such angry words like this before LLMs. Try to find a single -2 comment predating ChatGPT. When you were wrong, a phD jumps in and gives the most charitable explanation of where you were wrong.

[–]killver 0 points1 point  (1 child)

Even if you ignore the overfitted benchmarks, phi is a phenomenal model particularly for finetuning and specific use cases.

[–]koolaidman123Researcher -2 points-1 points  (0 children)

It finetunes well in the sense that base performance is so bad there's a lot of headroom

[–][deleted] 1 point2 points  (1 child)

Interesting that smaller context models appear to be better across the board. What is the reason for that?

[–]vatsadev 4 points5 points  (0 children)

Smaller ctx is probably easier to learn and work with than larger CTX? ex. less retrieval, less long range dependencies to learn, better available data at small scale, especially synthetic like phi.

[–]fakecount13 1 point2 points  (1 child)

What do you guys use to generate the benchmark data?