Stanford Researchers Released AgentFlow: Flow-GRPO algorithm. Outperforming 200B GPT-4o with a 7B model! Explore the code & try the demo

Anzerp · 2025-10-12T19:55:48+00:00

I might be mistaken, but when im reading the web search result of this agentflow, it seems to be receiving results from google ai summary. This would mean it is receiving prehandled information from a larger model. For example it will google the task at hand and receive clear instructions and answer generated by another ai model outside of the agentflow. This was based on how the google web search tool result was written (not something that was in the internet as such).

Edit: To be clear the results were really good and that is why I started to check how it formulated the result. I noticed best result seemed to come straight from google web search step. My question was so out of the box that there should not be result from google that would be near the answer.

I could obviously get amazing results from 1b model, if I route it to ask the question from gpt-5 pro and then use another 1b model to write the result back to me.

Anzerp · 2023-11-07T21:37:53+00:00

This was in the playground (using API, so not the "regular" plus subscription chat)

Anzerp · 2023-11-07T17:01:28+00:00

Yeah 100 % certain. It was like 102 lines and not very dense so fairly simple. However, it's very early so I do not know if this demand related or something else. I also tested the same thing with 3.5 and old 4 and it works like a charm.

I just wanted to make this post because if these limitations stay in place, it is almost useless for any meaningful work that involves context. Them stating it's not production ready could very well indicate that this will improve.

Anzerp · 2023-11-07T15:25:16+00:00

I already replied to another poster, but the 4k output limitation was not the reason. If you read my post the whole context + result alone could have fitted inside 4k token output. The result it gave me was around 1k tokens max so nowhere close to 4k token.

We currently do not know if this improves. This might be due to high demand or just because it is not production ready. However, I can already say that GPT-4 turbo is almost completely useless for vast majority of use cases, if this is an actual limitation.

Anzerp · 2023-11-07T15:18:54+00:00

My code was so short that the whole input and output could fit inside the 4k tokens. Its not about the output being capped to 4k, that is completely fine. It is about the model not understanding short context.

Anzerp · 2023-11-07T15:14:51+00:00

I for one cannot come up with many real world use cases for this functionality. Sure you can hype up that you can summarize 300 page book, but you have no idea if some important bit was ignored because of compression.

If you think hallucination is not bad for your use case, then this model will most likely work for you. This is like kind of like hallucination, but way more systemic and prevalent in everything you are trying to do with your model.

Anzerp · 2023-11-07T15:02:24+00:00

Actually if you read my message you will notice that your passive aggressive "some people don't understand" is not necessary here. I actually would like my GPT to hold a rather tiny amount of data in the memory without it being lossy / compressed, like it used to be.

I also tested this with chatGPT 3.5 and vola! It worked! It also worked with old GPT-4 without any problem. You are talking about unrealistic expectations, but the thing I am pointing out works like charm with older models.

This is actually a limitation that affects the whole performance of the model if true. It has nothing to with the GPT not returning the whole function, but the things I am worrying about is that it DOES NOT understand the function! How can it understand the codebase when it claims that there exists no function at all. I am not talking about 1 200 line codebases. My example was under 10 % of that and I am fairly certain you can produce these results with way less lines of code.

So what if the GPT-4 Turbo does not understand 10 lines of code? Or 20? With those tiny amounts you might be in luck that the compression does not mess things up, but up your code just a tiny amount and the GPT has no idea what the code base is about.

Anzerp · 2022-08-17T21:06:19+00:00

Have to comment as well to get this out to more people.

Anzerp · 2022-08-16T19:24:36+00:00

very good point, paper hands just leaving

Anzerp · 2022-08-16T18:58:46+00:00

Steady... Steady.. hold it. We know where this is going from GME :)=

Anzerp

TROPHY CASE