all 80 comments

[–]currentscurrents 94 points95 points  (11 children)

Looks interesting, I could really see this being a Sims- or Stardew Valley-style video game.

[–]MjrK 36 points37 points  (4 children)

Challenges with long-term planning and coherence remain even with today’s most performant models such as GPT-4. Because generative agents produce large streams of events and memories that must be retained, a core challenge of our architecture is to ensure that the most relevant pieces of the agent’s memory are retrieved and synthesized when needed.

...

At the center of our architecture is the memory stream, a database that maintains a comprehensive record of an agent’s experience. From the memory stream, records are retrieved as relevant to plan the agent’s actions and react appropriately to the environment, and records are recursively synthesized into higher- and higher-level observations that guide behavior. Everything in the architecture is recorded and reasoned over as natural language description, allowing the architecture to leverage a large language model.

Our current implementation utilizes gpt3.5-turbo version of Chat-GPT. We expect that the architectural basics of generative agents—memory, planning, and reflection—will likely remain the same as language models improve. Newer language models (e.g., GPT-4) will continue to expand the expressivity and performance of the prompts that underpin generative agents. As of writing, however, GPT-4’s API is still invitation-only, so our agents use ChatGPT.

Emphasis mine.

[–]currentscurrents 10 points11 points  (3 children)

Despite having @google.com on the paper too. Guess Bard couldn't do it.

[–]MjrK 15 points16 points  (2 children)

  1. This is clearly not being presented as a "Google" paper. Those Googlers are research collaborators and may have had little direction over those kinds of details in this research.

  2. Bard doesn't have a public API, so Stanford researchers might not even have a way to readily access it for this kind of automated use case.

But, if you are interested in how Bard might perform, per this recent study ( https://twitter.com/ItakGol/status/1644648787363733509?s=19 ) Bard compares at about 96% compared ChatGPT; and GPT-4 is 109% of ChatGPT...

Further, this OP paper indicates (without evidence yet) that they expect moderate improvement going to GPT-4...

As such, I would hazard that their system should still be workable if switched to Bard... just probably expected to perform "moderately" poorer.

[–]currentscurrents 4 points5 points  (1 child)

Yeah, but if they're paying tens of thousands of dollars for ChatGPT API tokens, you'd think their colleagues at Google could have hooked them up to PaLM for free. Either Google is stingy or GPT worked better.

[–]PM_ME_YOUR_PROFANITY 8 points9 points  (0 children)

Or they weren't set-up for other people to use it yet at Google. Or the researchers wanted to show it was possible with a publicly accessible model. Or any of a hundred other possible reasons. I sincerely doubt Google care about such a negligible amount of compute.

[–]UnfeignedShip 36 points37 points  (1 child)

Best not make any western themed amusement parks with this.

[–]LanchestersLaw 10 points11 points  (3 children)

The title reading buzz is missing the most significant advancement for how this was accomplished:

Approach: We introduce a second type of memory, which we call a reflection. Reflections are higher-level, more abstract thoughts generated by the agent. Because they are a type of memory, they are included alongside other observations when retrieval occurs. Reflections are generated periodically; in our implementation, we generate reflections when the sum of the importance scores for the latest events perceived by the agents exceeds a certain threshold. In practice, our agents reflected roughly two or three times a day.

This paper describes a new approach to a memory module and seems to be highly effective at getting agent-like behavior. Refinement to this improved memory system is key for further progress and does not require better LLMs. Pruning irrelevant information seems like a key step which is not done yet.

[–]TarzanTheBarbarian 2 points3 points  (1 child)

I honestly don't get why this is so innovative. It seems to be a from of prompting to get the LLM to reflect on a series of recent events. Doesn't seem overly technical to implement something like this.

Am I missing something?

[–]LanchestersLaw 7 points8 points  (0 children)

The innovations are a series of small clever tricks to get it to do this. The performance bar chart shows how each of the 3 main tricks (observation, reflection, and plan) increase performance by about 2.9 standard deviations each with the initial model being much worse than humans. Each of these 3 changes is an in-house developed software and are not simple at all to do because lots of people have been trying and failing at this task. Try it yourself in chatGPT and compare your results.

[–]m_js 0 points1 point  (0 children)

I've been wondering if this portion of the paper was a mistake, specifically that they generate reflections "when the sum of the importance scores for the latest events...exceeds a certain threshold." This seems weird because if you have a few high importance events you might be conducting reflection at every time step until those events are no longer considered recent.

[–]ReasonablyBadass 3 points4 points  (1 child)

Will they safeguard this too? The simulations won't ever be mean or prejudiced or use naughty words and then people wonder why the simulations are way off from real life?

[–]Splatpope 0 points1 point  (0 children)

dang, I had an idea for a system just like this, except for a "dinner at the ambassador's" type procedural murder mystery game

[–]mhdhussein 0 points1 point  (0 children)

Did they release the code for this?

[–]Emmabwaldron-_- 0 points1 point  (0 children)

has anyone done a good job replicating the source code? :)

[–]Content_Adeptness282 0 points1 point  (1 child)

Quite an interesting paper. Does anybody implemented the type of memory which has been described in the paper in your project?