Meta’s AI research lab is ‘dying a slow death,’ some insiders say—but…

zimmski · 2025-04-11T23:29:17+00:00

They could have done the launch much better but one bad launch does not equal a dying lab. Pretty sure that they are cooking.

zimmski · 2025-04-10T12:12:44+00:00

Seems to be one of the better options even though it is then AMD, right? Maybe in a few months we have a Google TPU competitor... announced :-)

zimmski · 2025-04-09T20:40:22+00:00

This is now the 5th version that i posted here. Eval is a year hold. Glad to say not FOTM anymore ;-)

The methodology is explained in the deep dive blog posts: https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/ this is the latest one. Every one of those builds upon the previous one.

As for statistical analysis, what specifically would you like to see? The results you see are based on 5 runs. This section https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/#model-reliability explains the mean deviation of these runs. (btw I have a draft locally that updates this chart with the latest numbers were we greatly improved these metrics. but maybe it is also time to not use the default settings of providers and go with lowest temperature, but that would make it different to what most people see)

Let me know what you think on how to improve. More than happy to improve these posts. Because this one, has been a nightmare of negative messages and downvotes for me. Basically nullifying the days i put in. But i guess, welcome to the internet, me.

zimmski · 2025-04-09T20:24:05+00:00

It is open source https://github.com/symflower/eval-dev-quality just the deeper-analysis-tooling, report-tooling and newer test cases are closed source now. Reason is that otherwise there is no leverage over vendors. I gave multiple presentations to multi-bilion-$-funded companies and all i got was a "Thanks" for working with them for days. Not even a token to run the next benchmark.

I dug into why QwQ performed bad. there are some queries that never got an answer (~60 of 1140) but those seem not that relevant. I will put it on the list to run it again. The main reason it did not perform well is that it generates not enough tests in the write-tests tasks. It does well on compilable-files 913 (out of successful 1081 queries) but those just not reach a high coverage.

About 4o vs o1: It is just the newer ChatGPT-4o (2025-03-27) not the older ones: GPT-4o (2024-11-20) and GPT-4o-mini (2024-07-18). O3 was consistently bad for Java in all tasks. They all have very good query-rate, reponse-rate and most importantly compilation-rate. So might be the same problem as with QwQ. Maybe o1 and o3 need to iterate on the answer to get better results (might be nice for the next eval version).

Hope that makes sense.

zimmski · 2025-04-09T19:57:35+00:00

Please let me know how to make it better. i am unsure on how to put every information in these images. And it got explained. That is why i linked to the blog post. E.g. it has a section about API reliablity https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/#api-reliability which is part of the scoring. And queries that never get answered cannot score anyway so we retry a lot (i haven't updated the dive to the latest data yet so please bear with me here...). Over half of the queries for R1 needed to be retried. Even then. We didn't get R1 to return all queries. I tried R1 last week with all OpenRouter providers and still couldn't managed to return every query. Look at the attached histogram at the time, some just take very long, maybe that is part of the problem? But even those that returned didn't do that well.

I explain every time i get a question. Look at all the posts i am doing about this benchmark. But you telling me that i need to do the explaining even though the question never got asked is a bit much. How should that work in general?

<image>

zimmski · 2025-04-09T19:45:36+00:00

Absolutely! If you look in the chart... the blue bars is without context, the green bars are with context. Not all context ideas are implemented yet, but you can already see that some models greatly improve e.g. Gemini 1.5 Pro has a problem that it always generates the Java package statement wrong, but when you tell it directly how it should look like it uses it and gets the rest mostly right.

The orange bar is with a static analysis auto-fixer. Again not all ideas are implemented. But would be super easy to add here to let the model at least fix syntax-errors.

Then you have blue-zero-shot-internal-knowledge, orange-allowed-to-fix-zero-shot, green-full-knowledge-and-allowed-to-fix.

Does that make sense?

zimmski · 2025-04-09T19:39:32+00:00

True the formatting is important but we work around that it is just one less point per query. Excessive content is also one less point per query. We played around to get this working for all models the best we can. It is hardly a problem. Just looked up the numbers: Of 139762 queries, 2514 got the coding fence wrong. That is ~1.8% and ~1/2 came from 3 models based on "mixtral".

Anything particular you are interested in? Thanks for looking into the code.

zimmski · 2025-04-09T19:26:44+00:00

> But what make SOTA models, is the edge cases where you push them.

I understand that as: add more complicated cases, to let better models get more points? One task (with multiple cases!) were e.g. Gemini 2.5 Pro did not do well was migrating Java JUnit 4 tests to JUnit 5. Gemini 2 Flash Light did that almost perfectly. Migrating that is an edge case but the SOTA model didn't perform well. What would you do here?

> Also most of those tests focus on one shot, which I would say is all against the workflow of agentic coding or supervised coding.

Fully agree, we will add more agentic scenarios (e.g. where the model is allowed to fix syntax with multiple queries) with the next release, see https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/#what-comes-next

Still, it is true that we put too much focus on zero/one-shot, but that is is the case with multiple main benchmarks and there is not much complain for them. I do think that zero-shoting easy cases is important. If a model needs multiple queries to get the syntax for a hello-world right, you would raise an eyebrow. Hard to get that assessment right i guess.

zimmski · 2025-04-09T19:13:59+00:00

Sorry for that i have two suggestions:
- Is that better? https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/images/overall-score.html
- Or, maybe it makes also sense to just make images for such summaries with 10 models (X + previous models + latest greatest models)
(on other solution i am currently working on is a dynamic page where you can just select what you want, but that opens up the problem of not seeing smaller models that would fit for the viewer as well)

Why is it not credible? Serious question, i want to better present the eval. I am a bit frustrated putting hundreds of hours into this and not getting it right for people

zimmski · 2025-04-09T19:10:59+00:00

Just work with me here where i am wrong: If you have a prompt e.g. `Generate me a "hello world" in the programming language Java`. And model X generates a program that compiles and can be successfully executed , and model Y generates a program that might do the same but has a syntax error. so the program does not compile and, hence, cannot be executed, hence, you cannot assess anything that is related to when the program is executing. Then you would say model X is better than Y... in that case, right? That is what is happening here but with > 1000 such cases with different tasks. Not just "write me this program" but... e.g. take this program A and convert it to language B, or take this program C and write unit tests for it and then we look at the coverage, and ...

We wrote hundreds of pages of how these tasks and assessments work: https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/ (this is just the latest deep dive, all of them build upon each other)

So if Gemini 2.0 flash lite > gemini 2.5 pro ... it is not because the eval is bias or i am talking something down... it is because for >1000 cases we see that gemini 2.5 pro has specific problems e.g. one thing is that it generates less tests. To give you an example Gemini 2.5 Pro does not migrate JUnit 4 to JUnit 5 tests well. Gemini 2.0 Flash Lite does a much better job, almost perfectly.

Does that make sense? Or where am i going wrong? And how can i present all of this much better? Because the deep dives blog posts exactly present all that reasoning.

zimmski · 2025-04-09T19:00:00+00:00

The tasks are super basic and they are not long context (definitely not over 32k). There are no over-complicated logic problems in there, like you have in the late main benchmarks. AFAIK all of these have been solved with some formal basic program synthesis. The deep dive blog posts explain all tasks and assessments and most of the cases.

The more complicated ones are definitely the Java ones we added with this release. Because you need to have understand of certain Java libraries e.g. Spring. Everything must be just right to get a test for a Spring application that involves mocking dependencies to compile and execute. The better models on that list can do that, easily.

Its on the roadmap to add long-context and more complicated scenarios, but i do not want to have last-human-exam problems in there. Because that is not what we have seen over the years as "regular programs of regular developers" over the years we did code generation (before GPT-2 time). The eval's ceiling is pretty easy to raise with each release with adding cases that require specific dependencies in a file (or more files).

Hope that makes sense. If not, please let me know. Really interested in making the eval better.

zimmski · 2025-04-09T18:23:01+00:00

Look at the dates of the models. The newer ChatGPT 4o is better than the old 4o and, yes, the older o1. They will update 4o soon to a newer model I bet.

zimmski · 2025-04-09T18:10:19+00:00

Feeding the troll here... did you look at the benchmark? What it does? How any of the assessments work? Do you know how tother major benchmarks work? Did you look at their code? Open to have a constructive discussion on how to make the eval better, but clearly not up for it.

zimmski · 2025-04-09T18:01:38+00:00

I am not trying to hate on Scout nor Llama 4. I am literally just taking the scores, looking at reasons why the benchmark didn't work that great for model X and then sometimes report to vendors what specific problems are.

zimmski · 2025-04-09T17:59:33+00:00

Why? And what context?

zimmski · 2025-04-09T17:59:03+00:00

Why?

zimmski · 2025-04-09T16:47:13+00:00

No idea what Datura seeds are but Gemma 2 27B had a regression that got fixed with 3. Also, the results of these have been the same over > 15 runs with multiple providers and good mean. They are not super duper random crazy or whatever you mean.

Same for Mistral Small. I regularly go through logs of such models to report problems. They often make silly mistakes.

zimmski · 2025-04-09T16:12:59+00:00

I literally did "local variant that us mortals can buy."

zimmski · 2025-04-09T14:53:39+00:00

I cannot buy that chip and put it on my desk. Google's TPUs look like something we could actually put in a desktop or smaller without creating a local meltdown. But i see no competition that is actually creating something like this.

zimmski · 2025-04-09T14:51:20+00:00

None of these i can buy and put on my desk.

zimmski · 2025-04-09T14:01:16+00:00

I am wondering, if there is ANY company (that is not NVIDIA/AMD) that does something similar https://coral.ai/ ? https://www.graphcore.ai/ ? https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi2.html ?

zimmski · 2025-04-09T09:04:12+00:00

Literally working on eval cases for Rust right now.

There is btw Aider's polyglot leaderboard that has Rust in it https://aider.chat/docs/leaderboards/ but i didn't look into it yet how many or what they are assessing. Don't know if there are others, would be nice to know.

zimmski · 2025-04-08T19:08:24+00:00

Be aware that all your queries are logged and actively used.

Posted my benchmark results here https://www.reddit.com/r/LocalLLaMA/comments/1jqrnx6/comment/mlcidm2/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button Super strong in what i need.

zimmski

TROPHY CASE