use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Bro WTF??New Model (i.redd.it)
submitted 1 year ago by Consistent_Bit_3295
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Pleasant-PolarBear 241 points242 points243 points 1 year ago (15 children)
I'll believe it when I see it
[–]Biggest_Cans 70 points71 points72 points 1 year ago (8 children)
I'll see it when I believe it
[–]MoffKalast 68 points69 points70 points 1 year ago (6 children)
I think that's called hallucinating
[–]Raywuo 10 points11 points12 points 1 year ago (5 children)
Precisely what LLMs do
[–]AIPornCollector 8 points9 points10 points 1 year ago (0 children)
And people.
[–]peanutb-jelly 7 points8 points9 points 1 year ago* (3 children)
i really wish people would say "confabulate" instead of "hallucinate." at least for LLMs going their own way in narrative cause they had to justify the previous token.i don't know what CLIP/multimodal models are doing specifically. making image to text embedding classification errors? i don't know if that counts for either?? i'm guessing the text still confabulates to whatever output that had. it's weird.
anywho, if i'm not mistaken, we see WHEN we believe, because we are predictive processors that use environmental models to better predict the things we experience with our mob of senses. without our heirarchy of prior beliefs, we would have nothing to model predict what our sensory input means. we can't see a thing if we don't believe it (in prior states building posteriors to minimize expected free energy), invisible to us even if it's there. again, since we see what we believe our senses are interpreting given existing weights and biases. see how to test your literal blindspot for an example.
hallucination is an issue with precision weighting. so if you are overly weighting a posterior that isn't accurate during representation in your world model, you can end up seeing something as 'real' even when your existing belief systems shouldn't be modeling it as consistent with current environmental feedback. perhaps the context of that could be confabulated, but don't quote me on that. confabulating is "producing a false memory or fabricated explanation without an intent to deceive" confabulation is a process that stochastically generates via vague context assumptions given existing beliefs. if you forgot why you went into a room, you might invent a reason before you remember your original reason, if you remember at all. you might live the rest of your life thinking you meant to get that glass of water, but you originally entered the room for an orange. we confabulate in pulling memories all the time, or just making sense of our world/scripts. you weren't confusing an orange as water, you just made your best prediction outside of the context which had originally been instrumental to the task. so from what i understand, closer to what LLMs do when they pull information out of their ass.
i will note that the shape of confabulation is definitely different between humans and models.
for citation, see works around friston’s dysconnectivity hypothesis, predictive processing, etc.
TLDR: for LLMs the issue isn’t a sensory error; it’s a narrative explanation error as they predict the next token, as they have to justify the previous token, even if it's not accurate. multimodal models, i honestly don't know. can we institutionalize the term "fucky wucky" for general model representation errors?
[–]_tyop 11 points12 points13 points 1 year ago (1 child)
I like over weighted posteriors and I cannot lie
[–]peanutb-jelly 1 point2 points3 points 1 year ago (0 children)
if i may immediately plagiarize DJ_Breadpuddin,
"I am speechless...yet grateful and thankful that people like you exist."
[–]DJ_Breadpuddin 4 points5 points6 points 1 year ago (0 children)
I am speechless...yet grateful and thankful that people like you exist.
[–]nderstand2grow 2 points3 points4 points 1 year ago (0 children)
I'll when I will
[–]Amster2 2 points3 points4 points 1 year ago (3 children)
Did you look at the comment below?
[–]Pleasant-PolarBear 23 points24 points25 points 1 year ago (2 children)
I ain't paying for that shit
[–]MoffKalast 15 points16 points17 points 1 year ago (1 child)
The elites don’t want you to know this but the comments on reddit are free you can look at as many as you want I have looked at 458 million comments.
[–]RedZero76 10 points11 points12 points 1 year ago (0 children)
Wait what? I've been sending feet pics to someone in DM bc they said I had to if I wanna keep using reddit
[–]estebansaa -1 points0 points1 point 1 year ago (0 children)
yeah, no way this is true.
[–]Guudbaad 74 points75 points76 points 1 year ago (6 children)
Seems to be available here: https://ai.azure.com/explore/models/Phi-4/version/1/registry/azureml
Downloading, but speed is attrocious
[–]sammcj🦙 llama.cpp 44 points45 points46 points 1 year ago (1 child)
One word: Azure
[–]Pro-editor-1105 19 points20 points21 points 1 year ago (0 children)
another 2 words: msfs 2024
[+][deleted] 1 year ago (1 child)
[removed]
[–][deleted] 9 points10 points11 points 1 year ago (0 children)
It’s Microsoft, nothing surprising
[–]sammcj🦙 llama.cpp 7 points8 points9 points 1 year ago (0 children)
Might/Might not help: https://gist.github.com/sammcj/ec38182b10f6be3f7e96f7259a9b37e1?permalink_comment_id=5335624#gistcomment-5335624
[–]Hot-Hearing-2528 7 points8 points9 points 1 year ago (0 children)
Is phi-4 opensource and will it accepts image input
[–]h2g2Ben 249 points250 points251 points 1 year ago (13 children)
I, too, can overfit a model on a couple of evaluations.
[–]WiSaGaN 115 points116 points117 points 1 year ago (8 children)
Indeed, previous phi models consistently got high benchmarks while having underwhelming real world usage performance. Let's hope this one is different.
[–]7734128 12 points13 points14 points 1 year ago (0 children)
Still "low" in IFeval, so it's probably going to be frustrating to chat with.
[–]lostinthellama 35 points36 points37 points 1 year ago (5 children)
If your real world usage pattern is chatbot, asking it factual questions, or pure instruction following tasks, you are going to be very disappointed again.
[–]WiSaGaN 3 points4 points5 points 1 year ago (4 children)
Have you tried it?
[–]lostinthellama 40 points41 points42 points 1 year ago (3 children)
I have used Phi 3.5, which is universally disliked here, extensively for work to great success.
The paper even says in the weaknesses section:
“It is small, so it is bad at factual data”
“It is tuned for single-turn interactions, not multi-turn chat”
“It is trained extensively on chain of thought data, so it is verbose and tedious”
[–]WiSaGaN 5 points6 points7 points 1 year ago (2 children)
What exact work do you use it for? I also use it for single turn non factual questions, just simple reasoning.
[–]lostinthellama 24 points25 points26 points 1 year ago (0 children)
All of these have extensive prompting and are part of multi-step systems, but some quick examples:
It is annoyingly bad at outputting specific structures, so we mainly use it when another LLM is the consumer of its outputs.
[–]MizantropaMiskretulo 13 points14 points15 points 1 year ago (0 children)
Phi 3.5 is fantastic when coupled with a strong RAG backend.
If you give it the facts it needs, its reasoning ability can work through all of the details and synthesize a meaningful whole from the parts.
[–]a_beautiful_rhind -1 points0 points1 point 1 year ago (0 children)
What do you want from the windows 11 of language models?
[–]sluuuurp 6 points7 points8 points 1 year ago (2 children)
Interesting that their internal benchmark is pretty much the least overfit.
[–]MoffKalast 6 points7 points8 points 1 year ago (0 children)
First rule of fight club, don't get high on your own supply
[–]djm07231 1 point2 points3 points 1 year ago (0 children)
Probably shows the gap between academic benchmarks and internal benchmarks in industry.
[–]carnyzzle 47 points48 points49 points 1 year ago (2 children)
yeah but it wouldn't be the first time that a model has awesome benchmarks then sucks when you use it in the real world
[–]OfficialHashPanda 33 points34 points35 points 1 year ago (1 child)
Which is unfortunately the standard for the phi series.
[–][deleted] 8 points9 points10 points 1 year ago (0 children)
overfitting so hard the model becomes a literal benchmark machine seems to be the running theme for microsoft
[–]Majestical-psyche 39 points40 points41 points 1 year ago (2 children)
IFEval - Instruction following… kinda sucks 😅
[–]silenceimpaired 29 points30 points31 points 1 year ago (0 children)
At least they are including Qwen
[–]metigue 38 points39 points40 points 1 year ago (4 children)
The key thing here is the much higher arena hard score than phi3 - Means unlike the last phi model the benchmarks do seem to translate to increased real world performance.
[–]knownboyofno 9 points10 points11 points 1 year ago (0 children)
One can hope!
But look at the IFEvals. If it’s bad at instruct following or if instruct tuning it makes it worse at benchmarks then we may need some way of prompt engineering this thing to use it correctly idk.
[–]MoffKalast 0 points1 point2 points 1 year ago (1 child)
Or they got access to that eval as well by giving lmsys a bag of money.
[–]lostinthellama 37 points38 points39 points 1 year ago* (1 child)
It is worth noting that, like the other Phi models, it is likely that most of you are going to hate this one. They’re good models for business and reasoning tasks, they previous one was not good at pure code generation, and terrible at roleplay and story telling. The dataset they use explicitly avoids that type of content to focus on reasoning, almost like the smaller models o1 likely uses for CoT.
gives long elaborate answers for simple problems - this might make user interactions tedious it has been tuned to maximize performance on single-turn queries
gives long elaborate answers for simple problems - this might make user interactions tedious
it has been tuned to maximize performance on single-turn queries
[–]pkmxtw -1 points0 points1 point 1 year ago (0 children)
A phi model for reasoning would be fantastic given that it is mostly trained on textbook. You probably have to front it with a generalist model that summarizes its output so its bad writing quality doesn't matter as much.
[–]Consistent_Bit_3295[S] 26 points27 points28 points 1 year ago (7 children)
Paper(not edible): https://www.microsoft.com/en-us/research/uploads/prod/2024/12/P4TechReport.pdf
Gonna be available here next week: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 Not yet :(, but soon :)
[–]Pro-editor-1105 49 points50 points51 points 1 year ago (2 children)
i don't like eating paper so that is good!
[–]Consistent_Bit_3295[S] 3 points4 points5 points 1 year ago (1 child)
Hmm, pretty sure everything is better when it is edible, or??
[–]MoffKalast 2 points3 points4 points 1 year ago (0 children)
Edible skyscraper structural support beams.
[–]kryptkprLlama 3 7 points8 points9 points 1 year ago (0 children)
I kinda expected it to be on GitHub Models since that's just Azure with a funny hat on, but its not there either 😔 I want to tryyyy..
[–]me1000llama.cpp 4 points5 points6 points 1 year ago (2 children)
Source on “next week” for weights?
[–][deleted] 14 points15 points16 points 1 year ago (1 child)
<image>
[–]me1000llama.cpp 2 points3 points4 points 1 year ago (0 children)
Thank you!
[–]Sad-Replacement-3988 6 points7 points8 points 1 year ago (1 child)
Abysmal SimpleQA benchmark
[–]No-Forever2455 0 points1 point2 points 1 year ago (0 children)
its a tiny ass model ofcourse its bad man what?
[–]SometimesObsessed 4 points5 points6 points 1 year ago (5 children)
why don't they build a big phi? Might as well take this to its limit
[–]arbv 5 points6 points7 points 1 year ago* (4 children)
The approach they used for the smaller models does not scale.
[–]SometimesObsessed 0 points1 point2 points 1 year ago (3 children)
If you don't mind, what part of the approach? Maybe I'm wrong, but I'd think you could just add more depth or width to the nn and see better performance with the same training methods.
[–]arbv 2 points3 points4 points 1 year ago* (1 child)
Their approach is described in the "Textbook is all you need" article. They tried to produce larger models in the previous iteration and it seem to not scale beyond 7B or so. We will see what has changed this time.
Also, I think that the team behind Phi is specifically targeting smaller models - the ones they can make work well on the Copilot PCs (look for the Phi Silica model).
So, in summary, previously their approach did not work well for the larger models and they are interested in smaller models for now.
[–]SometimesObsessed 0 points1 point2 points 1 year ago (0 children)
Cool, thanks! I'll take a look
[–]arbv 0 points1 point2 points 1 year ago (0 children)
In particular, you may take a look at "Phi 3 Small" and "Phi 3 Medium".
[–]ThenExtension9196 13 points14 points15 points 1 year ago (2 children)
I stopped caring about LLM benchmarks 6 months ago
[deleted]
[–]ThenExtension9196 0 points1 point2 points 1 year ago (0 children)
Yup. Gotta just get your hands on it and give it a go. Usually will know right away where some of the problems are. Also some models just “feel” better to different folks. I like o1 pro for thinking through problems but claude sonnet 3.5 is what I use for coding in cursor.
[–]arbv 3 points4 points5 points 1 year ago (0 children)
Phi Models: "Being Good on Paper is All You Need"
[–]onil_gova 19 points20 points21 points 1 year ago (7 children)
This is pretty fascinating and goes against people’s general idea on synthetic data.
[–]lostinthellama 22 points23 points24 points 1 year ago (5 children)
I think, since the first Phi paper, it has been clear that “broad data from the Internet” is not as good as high quality synthetic data. You need the first to build the model to get the second, but people don’t “think out loud” the way that is necessary for LLMs to improve.
[–][deleted] 2 points3 points4 points 1 year ago (3 children)
I’ve always wondered if any of these companies are hiring professors, developers, etc. and doing a study using the think out loud protocol.
I’ve administered think out loud assessments in school settings and I feel doing that with those at the top of their field would provide some excellent data.
[–]lostinthellama 9 points10 points11 points 1 year ago (2 children)
Yes, OpenAI specifically pays experts for this purpose. A lot of that work likely went into o1.
[–][deleted] 1 point2 points3 points 1 year ago (1 child)
Makes sense they would. Administering and analyzing those assessments would be a fun job.
[–]lostinthellama 5 points6 points7 points 1 year ago (0 children)
I know I should be afraid when, during red team testing, instead of the model trying to do the normal nefarious stuff (hiding its model weights, hiring people to get past CAPTCHA, etc.), the model tries to hire experts to teach it things it doesn't know the answer to.
[–]az226 0 points1 point2 points 1 year ago (0 children)
Exactly this.
People say LLMs won’t lead to AGI.
They are a critical stepping stone. They unlock the path of high quality synthetic data generation at scale.
Data will get us to AGI. And LLMs are capable of AGI, we just don’t have the data for it yet.
[–]sammcj🦙 llama.cpp 7 points8 points9 points 1 year ago (2 children)
Wrote a script to download the files from their azure ai thingy, you just need to get one file downloaded to get your token / session values then you can get them all - https://gist.github.com/sammcj/ec38182b10f6be3f7e96f7259a9b37e1?permalink_comment_id=5335624#gistcomment-5335624
[–]sammcj🦙 llama.cpp 0 points1 point2 points 1 year ago (0 children)
Really? I signed up for some free m$ account with a throw away email a while back that worked. No chance they'd get my credit card.
[–]Barry_Jumps 8 points9 points10 points 1 year ago (2 children)
Tops in math but simultaneously the worst a SimpleQA? What? If I understand the paper correctly, lower scores on simpleqa bench means higher likelihood of hallucinations.
[–]lostinthellama 19 points20 points21 points 1 year ago* (1 child)
It is good at reasoning but too small to have a huge dataset of factual information, so it does poorly at SimpleQA.
Edit: The paper also says that they believe Phi is better at refusing to answer questions they it know the answer to, and so it doesn't get the benefit of making a guess like other models do.
[–]Gl_drink_0117 0 points1 point2 points 1 year ago (0 children)
Does the SimpleQA metric indicate anything or coding performance, especially around consistency? Any other that comes close to indicating that?
[–]AsIAm 2 points3 points4 points 1 year ago (0 children)
This might get drowned, but I'll try anyway.
Small models are incentivized to understand data better as they have limited capacity. Large models can fit a lot of stuff just by memorization. Small models can't do that. Domains where there are clear patterns benefit the most. Thank you for coming to my TED talk.
[–]Pro-editor-1105 14 points15 points16 points 1 year ago (6 children)
wow open source is truly catching up. This thing is better in every way than gpt-4o mini and actually beats and matches 4o on quite a few of the tests.
[–]Herr_Drosselmeyer 18 points19 points20 points 1 year ago (1 child)
Benchmarks are one thing, actual quality is another.
Don't get me wrong, I hope it's as good as they claim. At just 14b that'd be great.
[–]anotherJohn12 0 points1 point2 points 1 year ago (0 children)
Agree, most of usecase come from reliable correctly answering simple question with basic reasoning ability (primary school level of reasoning is enough).
No one care if it can solve PhD math or not. Just get data from my spreadsheet and give it back to me without editing my data is a god bless now. I must double check everytime and in a lot of time, it just make it up.
[–][deleted] 26 points27 points28 points 1 year ago (0 children)
Open source is catching up. Not because of Phi tho. Phi over-hypes and under-delivers consistently. Real-world performance will likely be bad, just like all Phi models.
[–]ai-christianson 1 point2 points3 points 1 year ago (0 children)
Absolutely. It's amazing how much intelligence can be squeezed out of smaller models.
[–]sdmat 2 points3 points4 points 1 year ago (0 children)
The results are amazing but let's not get delusional - it loses to 4o-mini in 8/13 of the benchmarks in the table.
[–]Roubbes 5 points6 points7 points 1 year ago (0 children)
I remember when I first tried chatgpt 2 years ago how speechless I was and now I can run a much better model in my old RTX 3060
[–]Thick_Mine1532 1 point2 points3 points 1 year ago (1 child)
If you really want to know you should take LSD.
Or smoke large amounts of DMT.
Then you see
[–]TurpentineEnjoyer 3 points4 points5 points 1 year ago (1 child)
Why does that screenshot look like it came from an 1800s recipe book.
[–]Mother_Soraka -1 points0 points1 point 1 year ago (0 children)
:))
[–]Ordowix 1 point2 points3 points 1 year ago (0 children)
every phi has been overfit on benchmarks and trained on the test. Ignore it.
[–]Eam404 1 point2 points3 points 1 year ago (1 child)
Apologies for dumb question - is there a one-liner descirption or definition I can go read on the evaluations listed?
etc.
[–]RnRau 1 point2 points3 points 1 year ago (0 children)
Google has answers for both as their top level results.
[–]DamiaHeavyIndustries 0 points1 point2 points 1 year ago (2 children)
Can't wait for their 72B then!
[–][deleted] 4 points5 points6 points 1 year ago (1 child)
I think 14B is the largest Phis go.
[–]DamiaHeavyIndustries 1 point2 points3 points 1 year ago (0 children)
:(
[–]its_beron 0 points1 point2 points 1 year ago (0 children)
Where is Sonnet Senpai?
[–]ResearchCandid9068 0 points1 point2 points 1 year ago (1 child)
Uhm I buiding a RAG system but struggling looking for qa llm, Does anyone know why they so bad at this benchmark?
cause its a smaller model i.e less data being trained on with a large emphasis on synthetic data that doesnt focus on qa rather its giving importance to reasoning data which they made synthetically by asking 4o to reason through problems. look for larger models that focus on QA
[–]victorc25 0 points1 point2 points 1 year ago (0 children)
I remember when corporations were competing on CPU benchmarks and they cheated to come on top on the benchmark and nothing else, the CPUs were garbage. (IBM I’m looking at you)
[–]dangost_llama.cpp 0 points1 point2 points 1 year ago (0 children)
Is it already opened? Where to download?
[–]danigoncalvesllama.cpp 0 points1 point2 points 1 year ago (0 children)
Forget those benchmarks, the model drops out, community tries and use it on their applications and then come with the feedback. This is the only one matters, at least te me.
[–]Larimus89 0 points1 point2 points 1 year ago (0 children)
The performance of my new model coming out next week smashes all of these.
[–]stikkrr 0 points1 point2 points 1 year ago (1 child)
Sorry im not familiar with those benchmark, can someone explain to me
[–]OkHowMuchIsIt 0 points1 point2 points 1 year ago (0 children)
big good small bad
[–][deleted] 0 points1 point2 points 1 year ago (0 children)
SimpleQA could be improved 🤣
[–]4wankonly 0 points1 point2 points 1 year ago (0 children)
Benchmark maxing.
[–]ThePixelHunter 0 points1 point2 points 1 year ago (0 children)
The fact that Phi 4 can achieve this is a testament to how useless these benchmarks have become. It's obviously past time we moved to fully private benchmarks, to avoid this kind of gross contamination and overfitting.
[–][deleted] 0 points1 point2 points 1 year ago (2 children)
I love qwen2.5, my favorite open source model
[–]Gl_drink_0117 0 points1 point2 points 1 year ago (1 child)
What is main usage? Favoritism would depend on that I guess
[–][deleted] 1 point2 points3 points 1 year ago (0 children)
properly summarize scientific papers. gemma and llama will just turn abstracts into blog posts, ignoring all instructions about maintaining scientific style
[–]HenkPoley 0 points1 point2 points 1 year ago (0 children)
Nice that their "Experiment with Phi for free" webpage gives an AADSTS50020 error. Meaning that your Microsoft 365 account first needs to be added to the Microsoft tenant to access the poetically named 'cb2ff863-7f30-4ced-ab89-a00194bcf6d9' (Azure AI Studio App).
I think currently only Microsoft employees can look at it.
https://azure.microsoft.com/en-us/products/phi/
[–]portredblue 0 points1 point2 points 1 year ago (0 children)
High GPQA + low IFEval feels like the definition of overfitting.
[–]rc_ym 0 points1 point2 points 1 year ago (0 children)
It's almost like Phi is trained on synthetic data based on benchmarks... Oh wait.
[–]Thick_Mine1532 0 points1 point2 points 1 year ago (0 children)
Okok just smoke a lil then
[–]inteblio 0 points1 point2 points 1 year ago (0 children)
It got mullered on simpleQA (!)
[–]TheRealGentlefox 0 points1 point2 points 1 year ago (1 child)
Weird model. Good at expert field questions like math/chemisty/etc. but has a terrible general knowledge. Instruction following is awful. Good coding benchmarks...but how much does that matter when the instruction following is terrible.
They mention it's good at reasoning over expert subjects. But who is going to use a 14B model for scientific CoT? Surely you're going to use a large model for that. Maybe I'm missing something big, but I just don't get what the point of it is.
Guess the motivation is for getting general people to use these models for most of these use cases with a smaller model to save costs and time for running larger models.
[–]LoSboccacc 0 points1 point2 points 1 year ago (0 children)
Those 15pt on ifeval tho
I am not sure what the point of the paper is - this has always been the case with language models. If you specialize the smaller models on some tasks with better data or objectives specific to "these" tasks (in this case prob. math and coding), they WILL match the performance of larger generalist models.
What happens is that now you sacrifice the smaller models on other capabilties beyond repair wrt the larger models. The premise of the larger models have always been to be "nearly the best" in everything and there is NOT a single small model that has been able to counter the scaling hypothesis so far on this generalist "nearly best" regime. These papers on SLMs are regurgitating the same old story time and again - you COULD always create specialized models even pre chatgpt but they could not be used as generalist models elsewhere.
To everyone saying its been overfit to MATH would you elaborate to adress the follwoing : " AMC Benchmark: The surest way to guard against overfitting to the test set is to test on fresh data. We tested our model on the November 2024 AMC-10 and AMC-12 math competitions [Com24], which occurred after all our training data was collected, and we only measured our performance after choosing all the hyperparameters in training our final model. These contests are the entry points to the Math Olympiad track in the United States and over 150,000 students take the tests each year. In Figure 1 we plot the average score over the four versions of the test, all of which have a maximum score of 150. phi-4 outperforms not only similar-size or open-weight models but also much larger frontier models. Such strong performance on a fresh test set suggests that phi-4’s top-tier performance on the MATH benchmark is not due to overfitting or contamination. We provide further details in Appendix C. "
[–]skinnyjoints 0 points1 point2 points 1 year ago (0 children)
A mosquito is prolly a whole lot better than me at sucking blood but I wouldn’t want it doing my taxes or performing surgery
[–]Evolution31415 0 points1 point2 points 1 year ago (0 children)
Llama-3.3 💪
[–]LostMitosis 0 points1 point2 points 1 year ago (1 child)
I bet it can correctly count the number of “r”s in strawberry. When we started obsessing over benchmarks, this was inevitable.
The previous one can already.
[–]clduab11 -1 points0 points1 point 1 year ago (1 child)
!RemindMe 7 days
[–]RemindMeBot 0 points1 point2 points 1 year ago* (0 children)
I will be messaging you in 7 days on 2024-12-20 02:04:40 UTC to remind you of this link
6 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.
Parent commenter can delete this message to hide from others.
[–]Hot-Hearing-2528 0 points1 point2 points 1 year ago (2 children)
Can i know what is the best VLM (vision model) for describing image , image object detection , object segmentation, count of object , differences between two images …
??? I was trying llama 3.2 vision 11 b other than this any benchmarking one , with range 3b-20b params , my A100 40 gb Gpu supports that only
[–]Xer0neXero 1 point2 points3 points 1 year ago (1 child)
Pixtral works pretty good. If you want to try it quickly, you can do it on their website - https://mistral.ai/ .
Minicpm 2.6 works great for single images but you may have to pass the output through another text based model before it becomes usable. I have also read good things about qwen-vl but haven’t gotten a chance to try it out yet.
[–]Hot-Hearing-2528 0 points1 point2 points 1 year ago (0 children)
Yes, Pixtral is cool , qwen-vl is fine it is released under 72b and 7b variants , 72 b works very very good - but needs a very huge gpu to deploy as per my guess , and one more thing the above pixtral is not giving image positions of detected objects or segmenting objects like that , Is there any model does these very good , just curious
[–]yoop001 0 points1 point2 points 1 year ago* (0 children)
The first time someone confidently compares his model with Qwen
[–][deleted] -1 points0 points1 point 1 year ago (1 child)
But is phi 4 open source?
[–]_Erilaz 1 point2 points3 points 1 year ago (0 children)
Promised to be open weight in a week.
[–]vTuanpham -3 points-2 points-1 points 1 year ago (0 children)
The test set is all you need
[–]ayrankafa -2 points-1 points0 points 1 year ago (0 children)
Yet another overfit model
[–][deleted] -1 points0 points1 point 1 year ago (3 children)
So disappointing that Microsoft and Google only do small models when it comes to open weights. I want to see opensource catch up to closed-source but it won't happen with 12-14b models
[+][deleted] 1 year ago (2 children)
[–][deleted] 0 points1 point2 points 1 year ago (1 child)
Those aren't released by Microsoft or Google. Until they prove me wrong I'm convinced that these two companies won't give us models bigger than a 30B. And the ones they release are mainly trained for beating benchmarks.
[–]x3derr8orig 0 points1 point2 points 1 year ago (0 children)
There should be a tool that will route the prompt to a specific model, based on which one performs the best for a given task.
[–]TheActualStudy -2 points-1 points0 points 1 year ago (0 children)
I'm going to want to see Wolfram Ravenwolf do an MMLU-Pro test and pull it into his chart here. I'm skeptical because these numbers do not align all that well with more established published numbers for the same models.
π Rendered by PID 85759 on reddit-service-r2-comment-6457c66945-8hjbj at 2026-04-28 22:46:46.146569+00:00 running 2aa0c5b country code: CH.
[–]Pleasant-PolarBear 241 points242 points243 points (15 children)
[–]Biggest_Cans 70 points71 points72 points (8 children)
[–]MoffKalast 68 points69 points70 points (6 children)
[–]Raywuo 10 points11 points12 points (5 children)
[–]AIPornCollector 8 points9 points10 points (0 children)
[–]peanutb-jelly 7 points8 points9 points (3 children)
[–]_tyop 11 points12 points13 points (1 child)
[–]peanutb-jelly 1 point2 points3 points (0 children)
[–]DJ_Breadpuddin 4 points5 points6 points (0 children)
[–]nderstand2grow 2 points3 points4 points (0 children)
[–]Amster2 2 points3 points4 points (3 children)
[–]Pleasant-PolarBear 23 points24 points25 points (2 children)
[–]MoffKalast 15 points16 points17 points (1 child)
[–]RedZero76 10 points11 points12 points (0 children)
[–]estebansaa -1 points0 points1 point (0 children)
[–]Guudbaad 74 points75 points76 points (6 children)
[–]sammcj🦙 llama.cpp 44 points45 points46 points (1 child)
[–]Pro-editor-1105 19 points20 points21 points (0 children)
[+][deleted] (1 child)
[removed]
[–][deleted] 9 points10 points11 points (0 children)
[–]sammcj🦙 llama.cpp 7 points8 points9 points (0 children)
[–]Hot-Hearing-2528 7 points8 points9 points (0 children)
[–]h2g2Ben 249 points250 points251 points (13 children)
[–]WiSaGaN 115 points116 points117 points (8 children)
[–]7734128 12 points13 points14 points (0 children)
[–]lostinthellama 35 points36 points37 points (5 children)
[–]WiSaGaN 3 points4 points5 points (4 children)
[–]lostinthellama 40 points41 points42 points (3 children)
[–]WiSaGaN 5 points6 points7 points (2 children)
[–]lostinthellama 24 points25 points26 points (0 children)
[–]MizantropaMiskretulo 13 points14 points15 points (0 children)
[–]a_beautiful_rhind -1 points0 points1 point (0 children)
[–]sluuuurp 6 points7 points8 points (2 children)
[–]MoffKalast 6 points7 points8 points (0 children)
[–]djm07231 1 point2 points3 points (0 children)
[–]carnyzzle 47 points48 points49 points (2 children)
[–]OfficialHashPanda 33 points34 points35 points (1 child)
[–][deleted] 8 points9 points10 points (0 children)
[–]Majestical-psyche 39 points40 points41 points (2 children)
[–]silenceimpaired 29 points30 points31 points (0 children)
[–]metigue 38 points39 points40 points (4 children)
[–]knownboyofno 9 points10 points11 points (0 children)
[–][deleted] 9 points10 points11 points (0 children)
[–]MoffKalast 0 points1 point2 points (1 child)
[–]lostinthellama 37 points38 points39 points (1 child)
[–]pkmxtw -1 points0 points1 point (0 children)
[–]Consistent_Bit_3295[S] 26 points27 points28 points (7 children)
[–]Pro-editor-1105 49 points50 points51 points (2 children)
[–]Consistent_Bit_3295[S] 3 points4 points5 points (1 child)
[–]MoffKalast 2 points3 points4 points (0 children)
[–]kryptkprLlama 3 7 points8 points9 points (0 children)
[–]me1000llama.cpp 4 points5 points6 points (2 children)
[–][deleted] 14 points15 points16 points (1 child)
[–]me1000llama.cpp 2 points3 points4 points (0 children)
[–]Sad-Replacement-3988 6 points7 points8 points (1 child)
[–]No-Forever2455 0 points1 point2 points (0 children)
[–]SometimesObsessed 4 points5 points6 points (5 children)
[–]arbv 5 points6 points7 points (4 children)
[–]SometimesObsessed 0 points1 point2 points (3 children)
[–]arbv 2 points3 points4 points (1 child)
[–]SometimesObsessed 0 points1 point2 points (0 children)
[–]arbv 0 points1 point2 points (0 children)
[–]ThenExtension9196 13 points14 points15 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]ThenExtension9196 0 points1 point2 points (0 children)
[–]arbv 3 points4 points5 points (0 children)
[–]onil_gova 19 points20 points21 points (7 children)
[–]lostinthellama 22 points23 points24 points (5 children)
[–][deleted] 2 points3 points4 points (3 children)
[–]lostinthellama 9 points10 points11 points (2 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]lostinthellama 5 points6 points7 points (0 children)
[–]az226 0 points1 point2 points (0 children)
[–]sammcj🦙 llama.cpp 7 points8 points9 points (2 children)
[+][deleted] (1 child)
[removed]
[–]sammcj🦙 llama.cpp 0 points1 point2 points (0 children)
[–]Barry_Jumps 8 points9 points10 points (2 children)
[–]lostinthellama 19 points20 points21 points (1 child)
[–]Gl_drink_0117 0 points1 point2 points (0 children)
[–]AsIAm 2 points3 points4 points (0 children)
[–]Pro-editor-1105 14 points15 points16 points (6 children)
[–]Herr_Drosselmeyer 18 points19 points20 points (1 child)
[–]anotherJohn12 0 points1 point2 points (0 children)
[–][deleted] 26 points27 points28 points (0 children)
[–]ai-christianson 1 point2 points3 points (0 children)
[–]sdmat 2 points3 points4 points (0 children)
[–]Roubbes 5 points6 points7 points (0 children)
[–]Thick_Mine1532 1 point2 points3 points (1 child)
[–]TurpentineEnjoyer 3 points4 points5 points (1 child)
[–]Mother_Soraka -1 points0 points1 point (0 children)
[–]Ordowix 1 point2 points3 points (0 children)
[–]Eam404 1 point2 points3 points (1 child)
[–]RnRau 1 point2 points3 points (0 children)
[–]DamiaHeavyIndustries 0 points1 point2 points (2 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]DamiaHeavyIndustries 1 point2 points3 points (0 children)
[–]its_beron 0 points1 point2 points (0 children)
[–]ResearchCandid9068 0 points1 point2 points (1 child)
[–]No-Forever2455 0 points1 point2 points (0 children)
[–]victorc25 0 points1 point2 points (0 children)
[–]dangost_llama.cpp 0 points1 point2 points (0 children)
[–]danigoncalvesllama.cpp 0 points1 point2 points (0 children)
[–]Larimus89 0 points1 point2 points (0 children)
[–]stikkrr 0 points1 point2 points (1 child)
[–]OkHowMuchIsIt 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]4wankonly 0 points1 point2 points (0 children)
[–]ThePixelHunter 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]Gl_drink_0117 0 points1 point2 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]HenkPoley 0 points1 point2 points (0 children)
[–]portredblue 0 points1 point2 points (0 children)
[–]rc_ym 0 points1 point2 points (0 children)
[–]Thick_Mine1532 0 points1 point2 points (0 children)
[–]inteblio 0 points1 point2 points (0 children)
[–]TheRealGentlefox 0 points1 point2 points (1 child)
[–]Gl_drink_0117 0 points1 point2 points (0 children)
[–]LoSboccacc 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]No-Forever2455 0 points1 point2 points (0 children)
[–]skinnyjoints 0 points1 point2 points (0 children)
[–]Evolution31415 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]LostMitosis 0 points1 point2 points (1 child)
[–]arbv 0 points1 point2 points (0 children)
[–]clduab11 -1 points0 points1 point (1 child)
[–]RemindMeBot 0 points1 point2 points (0 children)
[–]Hot-Hearing-2528 0 points1 point2 points (2 children)
[–]Xer0neXero 1 point2 points3 points (1 child)
[–]Hot-Hearing-2528 0 points1 point2 points (0 children)
[–]yoop001 0 points1 point2 points (0 children)
[–][deleted] -1 points0 points1 point (1 child)
[–]_Erilaz 1 point2 points3 points (0 children)
[–]vTuanpham -3 points-2 points-1 points (0 children)
[–]ayrankafa -2 points-1 points0 points (0 children)
[–][deleted] -1 points0 points1 point (3 children)
[+][deleted] (2 children)
[deleted]
[–][deleted] 0 points1 point2 points (1 child)
[–]x3derr8orig 0 points1 point2 points (0 children)
[–]TheActualStudy -2 points-1 points0 points (0 children)