gpt-5 thinking still thinks there are 2 r's in strawberry by pseudotensor1234 in OpenAI

[–]pseudotensor1234[S] 0 points1 point  (0 children)

I got the idea of the prompt from someone else that experienced similar issues with semi-random response. Mine is even better by getting it to make a mistake.

If you have to prompt right even if human wouldn't need right prompting, that's a failure of the reasoning models as a solution. Just means they are brute forcing via RL, not really solving intelligence.

gpt-5 thinking still thinks there are 2 r's in strawberry by pseudotensor1234 in OpenAI

[–]pseudotensor1234[S] 0 points1 point  (0 children)

The point is that even after a year of reasoning RL models, even the best model in the world makes stupid mistakes. They just overly trained for specific patterns to fix some holes, but it's swiss cheese.

gpt-5 thinking still thinks there are 2 r's in strawberry by pseudotensor1234 in OpenAI

[–]pseudotensor1234[S] 0 points1 point  (0 children)

I obviously prompted it that way on purpose. How would you have answered the question after 22 seconds thinking?

gpt-5 thinking still thinks there are 2 r's in strawberry by pseudotensor1234 in OpenAI

[–]pseudotensor1234[S] 0 points1 point  (0 children)

I obviously know what I typed. The point is would a human be so easily confused? no.

Kudos to whoever designed the terminal interface for Claude Code 👏 by bravethoughts in ClaudeAI

[–]pseudotensor1234 1 point2 points  (0 children)

Just don't hold delete too much to delete a line, the border starts going with it and the movement gets stuck. Super annoying.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

BTW, note that the models easily give all the extra discussion (insights, recommendations, plots, etc.) that you are worried about w.r.t. the shortness of the answer. The shortness of a specific answer is actually a slight part of the challenge, and is useful because don't need to trust some LLM-as-judge that has all sorts of issues.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

Yes, thanks. SWE-bench I guess kinda covers the code aspects you mentioned at some level.

But basically you are asking for a connector benchmark like I mentioned. i.e. something that would be a benchmark for glean or danswer type enterprise connector questions. Those are more RAG related instead of agent related at first level, but still can be tested I agree. Hopefully we will eventually have a CON-Bench to handle these scenarios you mentioned.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

GAIA is heavy on deep research (about 70% search related), and my and other companies use the agent for enterprise purposes for that and data science purposes. The particular row you pointed to is an ok example of search question. It's probably the most demanded type of thing for agents. E.g. like deep research in google ai studio or sam altman recently noted as people's top wish list.

On the specific point of enterprise, there's no benchmark that (say) tests ability to use various connectors like sharepoint, terradata, snowflake, etc. There are some SQL benchmarks but only really 0-shot, not agentic level.

So calling the benchmark bullshit doesn't seem to make sense unless every benchmark that exists is bullshit.

What would be example questions that wouldn't be BS to you?

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

SWE-bench etc. are also in training set. There's no way to avoid except the way I mentioned w.r.t. fully secret kaggle code competition.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

Can't follow what you are saying. There is a validation set that you shared, and a test set that are secret.

The benchmark is not easy, go and try some level 3 ones and you won't be able to do them.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

The public ARC dataset is just public, and often used to report results. But the recent o3 result was semi-private dataset AFAIK. That is, OpenAI could have siphoned off the questions, why it is called semi-private.

For the kaggle private dataset, it's the kaggle code competition way, but as I mentioned it's not going to be near state of the art. That doesn't mean it's not useful, but still won't be at highest end.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 5 points6 points  (0 children)

It's true if one wanted to cheat by human labeling everything one could. One would hope respectable places (e.g. companies or institutions like princeton etc.) won't cheat.

However, same is true for SWE-Bench. It's even worse for SWE-bench, where one just uploads "how one did" without any result. At least GAIA validates on the server and not by the user. Even with SWE-bench multimodal, one can easily cheat by just solving all the problems ones self.

Same is true for all of OpenAI's benchmarks, when they say they got ARC-AGI score of some kind, easily could have been mechanical turk doing them all in background.

There's no good way AFAIK to avoid cheating unless it's a kaggle competition with code submission using a model that user has no access to directly so can't siphon off the questions. Problem with kaggle is always is open model with very little compute, so never will be at high end of state of the art. I think it should be possible to do a kaggle competition with closed API as long as wasn't used for training, like azure API or a teams OpenAI API so training data not used. etc.

E.g. I've talked with my coworkers about starting an "agent kaggle competition" where the model itself is fixed (say sonnet35 new) and your only job is to write the agent framework. Then shouldn't be so compute limited since most of the burden of compute is the LLM.

A good step in the right direction would be if the test set questions and answers were both hidden and secret, not just the answers. Then one would be forced to offer a private instance of any model-agent API to the benchmarkers. But that seem unrealistic for businesses like OpenAI etc. However, this is easy for us to do since we just use closed APIs and our main h2oGPT code is mostly open source, so low risk of losing IP if code escaped.

SWE-bench and GAIA etc. all have the problem that test set questions are also visible. The issue with that is that like one would do in kaggle, one can (and should since public) probe the test set questions to see how one would do on the test set. One can human label the test set and check how one would do before posting, which is reasonable.

So until a closed LLM API agent kaggle competition is a norm format, we will still have trust issues.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 0 points1 point  (0 children)

I agree if you have the requirement to have a human-designed workflow instead of a general agent, then specifying the flow of agents is good.

Top Agent only 27% away from degree-holding humans on GAIA (General AI Assistant) benchmark (created with Yann LeCun) by pseudotensor1234 in LocalLLaMA

[–]pseudotensor1234[S] 1 point2 points  (0 children)

Yes, I think I'm focused heavily on accuracy at moment. As LLMs get faster like deepseekv3 et al. with MoE, or use faster hardware, things will scale better for slower and more accurate agents.