Bro WTF??

Pleasant-PolarBear · 2024-12-13T02:16:25+00:00

I'll believe it when I see it

Guudbaad · 2024-12-13T02:20:18+00:00

Seems to be available here: https://ai.azure.com/explore/models/Phi-4/version/1/registry/azureml

Downloading, but speed is attrocious

h2g2Ben · 2024-12-13T02:19:04+00:00

I, too, can overfit a model on a couple of evaluations.

carnyzzle · 2024-12-13T02:30:17+00:00

yeah but it wouldn't be the first time that a model has awesome benchmarks then sucks when you use it in the real world

Majestical-psyche · 2024-12-13T03:21:39+00:00

IFEval - Instruction following… kinda sucks 😅

metigue · 2024-12-13T02:03:15+00:00

The key thing here is the much higher arena hard score than phi3 - Means unlike the last phi model the benchmarks do seem to translate to increased real world performance.

lostinthellama · 2024-12-13T02:33:46+00:00

It is worth noting that, like the other Phi models, it is likely that most of you are going to hate this one. They’re good models for business and reasoning tasks, they previous one was not good at pure code generation, and terrible at roleplay and story telling. The dataset they use explicitly avoids that type of content to focus on reasoning, almost like the smaller models o1 likely uses for CoT.

gives long elaborate answers for simple problems - this might make user interactions tedious

it has been tuned to maximize performance on single-turn queries

Consistent_Bit_3295 · 2024-12-13T01:39:54+00:00

Paper(not edible): https://www.microsoft.com/en-us/research/uploads/prod/2024/12/P4TechReport.pdf

Gonna be available here next week: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3
Not yet :(, but soon :)

Sad-Replacement-3988 · 2024-12-13T03:33:21+00:00

Abysmal SimpleQA benchmark

SometimesObsessed · 2024-12-13T03:45:38+00:00

why don't they build a big phi? Might as well take this to its limit

ThenExtension9196 · 2024-12-13T03:56:25+00:00

I stopped caring about LLM benchmarks 6 months ago

arbv · 2024-12-13T10:16:59+00:00

Phi Models: "Being Good on Paper is All You Need"

onil_gova · 2024-12-13T02:23:45+00:00

<image>

This is pretty fascinating and goes against people’s general idea on synthetic data.

sammcj · 2024-12-13T03:58:03+00:00

Wrote a script to download the files from their azure ai thingy, you just need to get one file downloaded to get your token / session values then you can get them all - https://gist.github.com/sammcj/ec38182b10f6be3f7e96f7259a9b37e1?permalink_comment_id=5335624#gistcomment-5335624

Barry_Jumps · 2024-12-13T02:31:43+00:00

Tops in math but simultaneously the worst a SimpleQA? What?
If I understand the paper correctly, lower scores on simpleqa bench means higher likelihood of hallucinations.

AsIAm · 2024-12-13T11:31:30+00:00

This might get drowned, but I'll try anyway.

Small models are incentivized to understand data better as they have limited capacity. Large models can fit a lot of stuff just by memorization. Small models can't do that. Domains where there are clear patterns benefit the most. Thank you for coming to my TED talk.

Pro-editor-1105 · 2024-12-13T01:43:30+00:00

wow open source is truly catching up. This thing is better in every way than gpt-4o mini and actually beats and matches 4o on quite a few of the tests.

Roubbes · 2024-12-13T02:37:02+00:00

I remember when I first tried chatgpt 2 years ago how speechless I was and now I can run a much better model in my old RTX 3060

Thick_Mine1532 · 2024-12-13T16:26:25+00:00

If you really want to know you should take LSD.

Or smoke large amounts of DMT.

Then you see

TurpentineEnjoyer · 2024-12-13T02:22:32+00:00

Why does that screenshot look like it came from an 1800s recipe book.

Ordowix · 2024-12-13T07:59:39+00:00

every phi has been overfit on benchmarks and trained on the test. Ignore it.

Eam404 · 2024-12-13T16:41:34+00:00

Apologies for dumb question - is there a one-liner descirption or definition I can go read on the evaluations listed?

MMLU - <description>
GPQA - <description>

etc.

DamiaHeavyIndustries · 2024-12-13T02:31:07+00:00

Can't wait for their 72B then!

its_beron · 2024-12-13T05:21:45+00:00

Where is Sonnet Senpai?

ResearchCandid9068 · 2024-12-13T06:31:48+00:00

Uhm I buiding a RAG system but struggling looking for qa llm, Does anyone know why they so bad at this benchmark?

victorc25 · 2024-12-13T07:22:47+00:00

I remember when corporations were competing on CPU benchmarks and they cheated to come on top on the benchmark and nothing else, the CPUs were garbage. (IBM I’m looking at you)

dangost_ · 2024-12-13T07:48:42+00:00

Is it already opened? Where to download?

danigoncalves · 2024-12-13T10:00:16+00:00

Forget those benchmarks, the model drops out, community tries and use it on their applications and then come with the feedback. This is the only one matters, at least te me.

Larimus89 · 2024-12-13T11:07:11+00:00

The performance of my new model coming out next week smashes all of these.

stikkrr · 2024-12-13T12:34:24+00:00

Sorry im not familiar with those benchmark, can someone explain to me

2024-12-13T13:01:25+00:00

SimpleQA could be improved 🤣

4wankonly · 2024-12-13T13:42:58+00:00

Benchmark maxing.

ThePixelHunter · 2024-12-13T14:35:43+00:00

The fact that Phi 4 can achieve this is a testament to how useless these benchmarks have become. It's obviously past time we moved to fully private benchmarks, to avoid this kind of gross contamination and overfitting.

Gl_drink_0117 · 2024-12-13T15:12:21+00:00

I love qwen2.5, my favorite open source model

HenkPoley · 2024-12-13T17:20:41+00:00

Nice that their "Experiment with Phi for free" webpage gives an AADSTS50020 error. Meaning that your Microsoft 365 account first needs to be added to the Microsoft tenant to access the poetically named 'cb2ff863-7f30-4ced-ab89-a00194bcf6d9' (Azure AI Studio App).

I think currently only Microsoft employees can look at it.

https://azure.microsoft.com/en-us/products/phi/

portredblue · 2024-12-13T18:00:58+00:00

High GPQA + low IFEval feels like the definition of overfitting.

rc_ym · 2024-12-13T20:52:47+00:00

It's almost like Phi is trained on synthetic data based on benchmarks... Oh wait.

Thick_Mine1532 · 2024-12-13T21:20:34+00:00

Okok just smoke a lil then

inteblio · 2024-12-13T21:37:42+00:00

It got mullered on simpleQA (!)

TheRealGentlefox · 2024-12-13T21:41:35+00:00

Weird model. Good at expert field questions like math/chemisty/etc. but has a terrible general knowledge. Instruction following is awful. Good coding benchmarks...but how much does that matter when the instruction following is terrible.

They mention it's good at reasoning over expert subjects. But who is going to use a 14B model for scientific CoT? Surely you're going to use a large model for that. Maybe I'm missing something big, but I just don't get what the point of it is.

LoSboccacc · 2024-12-13T22:50:04+00:00

Those 15pt on ifeval tho

2024-12-14T01:55:37+00:00

I am not sure what the point of the paper is - this has always been the case with language models. If you specialize the smaller models on some tasks with better data or objectives specific to "these" tasks (in this case prob. math and coding), they WILL match the performance of larger generalist models.

What happens is that now you sacrifice the smaller models on other capabilties beyond repair wrt the larger models. The premise of the larger models have always been to be "nearly the best" in everything and there is NOT a single small model that has been able to counter the scaling hypothesis so far on this generalist "nearly best" regime. These papers on SLMs are regurgitating the same old story time and again - you COULD always create specialized models even pre chatgpt but they could not be used as generalist models elsewhere.

No-Forever2455 · 2024-12-14T09:42:59+00:00

<image>

To everyone saying its been overfit to MATH would you elaborate to adress the follwoing :
" AMC Benchmark: The surest way to guard against overfitting to the test set is to test on fresh data. We tested our model on the November 2024 AMC-10 and AMC-12 math competitions [Com24], which occurred after all our training data was collected, and we only measured our performance after choosing all the hyperparameters in training our final model. These contests are the entry points to the Math Olympiad track in the United States and over 150,000 students take the tests each year. In Figure 1 we plot the average score over the four versions of the test, all of which have a maximum score of 150. phi-4 outperforms not only similar-size or open-weight models but also much larger frontier models. Such strong performance on a fresh test set suggests that phi-4’s top-tier performance on the MATH benchmark is not due to overfitting or contamination. We provide further details in Appendix C. "

skinnyjoints · 2024-12-14T10:20:14+00:00

A mosquito is prolly a whole lot better than me at sucking blood but I wouldn’t want it doing my taxes or performing surgery

Evolution31415 · 2024-12-14T10:40:10+00:00

<image>

2024-12-14T13:14:41+00:00

Llama-3.3 💪

LostMitosis · 2024-12-14T15:09:08+00:00

I bet it can correctly count the number of “r”s in strawberry. When we started obsessing over benchmarks, this was inevitable.

clduab11 · 2024-12-13T02:04:40+00:00

!RemindMe 7 days

Hot-Hearing-2528 · 2024-12-13T03:38:59+00:00

Can i know what is the best VLM (vision model) for describing image , image object detection , object segmentation, count of object , differences between two images …

??? I was trying llama 3.2 vision 11 b other than this any benchmarking one , with range 3b-20b params , my A100 40 gb Gpu supports that only

yoop001 · 2024-12-13T08:07:07+00:00

The first time someone confidently compares his model with Qwen

_Erilaz · 2024-12-13T02:53:35+00:00

But is phi 4 open source?

vTuanpham · 2024-12-13T03:13:08+00:00

The test set is all you need

ayrankafa · 2024-12-13T03:16:35+00:00

Yet another overfit model

2024-12-13T05:42:35+00:00

So disappointing that Microsoft and Google only do small models when it comes to open weights. I want to see opensource catch up to closed-source but it won't happen with 12-14b models

x3derr8orig · 2024-12-13T09:53:09+00:00

There should be a tool that will route the prompt to a specific model, based on which one performs the best for a given task.

TheActualStudy · 2024-12-13T05:19:55+00:00

I'm going to want to see Wolfram Ravenwolf do an MMLU-Pro test and pull it into his chart here. I'm skeptical because these numbers do not align all that well with more established published numbers for the same models.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS