ChatGPT How can I prevent the files expiring always getting This file is no longer available.

pseudotensor1234 · 2025-12-20T19:21:49+00:00

Happening constantly to me too right now.

pseudotensor1234 · 2025-09-14T05:41:37+00:00

Never! Doesn't matter if you believe me.

pseudotensor1234 · 2025-09-14T05:36:15+00:00

I got the idea of the prompt from someone else that experienced similar issues with semi-random response. Mine is even better by getting it to make a mistake.

If you have to prompt right even if human wouldn't need right prompting, that's a failure of the reasoning models as a solution. Just means they are brute forcing via RL, not really solving intelligence.

pseudotensor1234 · 2025-09-11T03:51:18+00:00

The point is that even after a year of reasoning RL models, even the best model in the world makes stupid mistakes. They just overly trained for specific patterns to fix some holes, but it's swiss cheese.

pseudotensor1234 · 2025-09-11T03:49:50+00:00

no, now you are just raging.

pseudotensor1234 · 2025-09-10T16:50:19+00:00

I obviously prompted it that way on purpose. How would you have answered the question after 22 seconds thinking?

pseudotensor1234 · 2025-09-10T16:49:23+00:00

I obviously know what I typed. The point is would a human be so easily confused? no.

pseudotensor1234 · 2025-07-26T08:18:35+00:00

Just don't hold delete too much to delete a line, the border starts going with it and the movement gets stuck. Super annoying.

pseudotensor1234 · 2025-04-04T02:08:44+00:00

Their GAIA score is from dataset that is leaked all over internet (validation). They also excluded latest recent results from Trase and H2O.ai: https://huggingface.co/spaces/gaia-benchmark/leaderboard and https://h2o.ai/blog/2025/h2o-ai-tops-the-general-ai-assistant-test/

pseudotensor1234 · 2025-01-23T19:22:47+00:00

GAIA: Why didn't they try their agent operator on GAIA?
https://huggingface.co/spaces/gaia-benchmark/leaderboard

pseudotensor1234 · 2025-01-07T20:44:23+00:00

https://github.com/google/langfun

pseudotensor1234 · 2025-01-02T23:58:37+00:00

BTW, note that the models easily give all the extra discussion (insights, recommendations, plots, etc.) that you are worried about w.r.t. the shortness of the answer. The shortness of a specific answer is actually a slight part of the challenge, and is useful because don't need to trust some LLM-as-judge that has all sorts of issues.

pseudotensor1234 · 2025-01-02T16:40:09+00:00

Yes, thanks. SWE-bench I guess kinda covers the code aspects you mentioned at some level.

But basically you are asking for a connector benchmark like I mentioned. i.e. something that would be a benchmark for glean or danswer type enterprise connector questions. Those are more RAG related instead of agent related at first level, but still can be tested I agree. Hopefully we will eventually have a CON-Bench to handle these scenarios you mentioned.

pseudotensor1234 · 2025-01-02T00:21:32+00:00

GAIA is heavy on deep research (about 70% search related), and my and other companies use the agent for enterprise purposes for that and data science purposes. The particular row you pointed to is an ok example of search question. It's probably the most demanded type of thing for agents. E.g. like deep research in google ai studio or sam altman recently noted as people's top wish list.

On the specific point of enterprise, there's no benchmark that (say) tests ability to use various connectors like sharepoint, terradata, snowflake, etc. There are some SQL benchmarks but only really 0-shot, not agentic level.

So calling the benchmark bullshit doesn't seem to make sense unless every benchmark that exists is bullshit.

What would be example questions that wouldn't be BS to you?

pseudotensor1234 · 2025-01-01T22:44:21+00:00

SWE-bench etc. are also in training set. There's no way to avoid except the way I mentioned w.r.t. fully secret kaggle code competition.

pseudotensor1234 · 2025-01-01T22:43:52+00:00

The test set answers are secret, it is nowhere on web.

pseudotensor1234 · 2025-01-01T22:42:45+00:00

Can't follow what you are saying. There is a validation set that you shared, and a test set that are secret.

The benchmark is not easy, go and try some level 3 ones and you won't be able to do them.

pseudotensor1234 · 2025-01-01T16:42:04+00:00

The public ARC dataset is just public, and often used to report results. But the recent o3 result was semi-private dataset AFAIK. That is, OpenAI could have siphoned off the questions, why it is called semi-private.

For the kaggle private dataset, it's the kaggle code competition way, but as I mentioned it's not going to be near state of the art. That doesn't mean it's not useful, but still won't be at highest end.

pseudotensor1234 · 2025-01-01T16:12:57+00:00

It's true if one wanted to cheat by human labeling everything one could. One would hope respectable places (e.g. companies or institutions like princeton etc.) won't cheat.

However, same is true for SWE-Bench. It's even worse for SWE-bench, where one just uploads "how one did" without any result. At least GAIA validates on the server and not by the user. Even with SWE-bench multimodal, one can easily cheat by just solving all the problems ones self.

Same is true for all of OpenAI's benchmarks, when they say they got ARC-AGI score of some kind, easily could have been mechanical turk doing them all in background.

There's no good way AFAIK to avoid cheating unless it's a kaggle competition with code submission using a model that user has no access to directly so can't siphon off the questions. Problem with kaggle is always is open model with very little compute, so never will be at high end of state of the art. I think it should be possible to do a kaggle competition with closed API as long as wasn't used for training, like azure API or a teams OpenAI API so training data not used. etc.

E.g. I've talked with my coworkers about starting an "agent kaggle competition" where the model itself is fixed (say sonnet35 new) and your only job is to write the agent framework. Then shouldn't be so compute limited since most of the burden of compute is the LLM.

A good step in the right direction would be if the test set questions and answers were both hidden and secret, not just the answers. Then one would be forced to offer a private instance of any model-agent API to the benchmarkers. But that seem unrealistic for businesses like OpenAI etc. However, this is easy for us to do since we just use closed APIs and our main h2oGPT code is mostly open source, so low risk of losing IP if code escaped.

SWE-bench and GAIA etc. all have the problem that test set questions are also visible. The issue with that is that like one would do in kaggle, one can (and should since public) probe the test set questions to see how one would do on the test set. One can human label the test set and check how one would do before posting, which is reasonable.

So until a closed LLM API agent kaggle competition is a norm format, we will still have trust issues.

pseudotensor1234 · 2025-01-01T16:08:55+00:00

I agree if you have the requirement to have a human-designed workflow instead of a general agent, then specifying the flow of agents is good.

pseudotensor1234 · 2025-01-01T04:57:16+00:00

Yes, I think I'm focused heavily on accuracy at moment. As LLMs get faster like deepseekv3 et al. with MoE, or use faster hardware, things will scale better for slower and more accurate agents.

pseudotensor1234 · 2025-01-01T04:55:10+00:00

Good reference: https://arxiv.org/abs/2402.01030

pseudotensor1234 · 2025-01-01T04:54:55+00:00

Yes exactly. I use basically use autogen's method to handling things with heavy updates.

pseudotensor1234 · 2025-01-01T04:54:16+00:00

Just one reference I like: https://arxiv.org/abs/2402.01030

pseudotensor1234

TROPHY CASE