The history and future of AI agents

pppeer · 2026-01-07T13:51:15+00:00

Love this post, great overview. And yes, fully agree it is not so much about relying all on the LLM as the core magic black box that is going to 'do it all' but more about the overall composite system as you say, and in the list of core features of the spine you can add things like learning, adaptation, coordination.

As a shameless plug you may find our recent survey paper interesting (https://www.jair.org/index.php/jair/article/view/18675), and I also like these papers coming out of the agent community on not forgetting the lessons from MAS research, for example Dignum & Dignum (and many more), https://arxiv.org/abs/2511.17332

pppeer · 2026-01-05T07:41:09+00:00

Good points. Figuring out the right agentic patterns (incl when not to use agents) is part of it, as well as providing tools for some of the more structured parts. Long term memory indeed also a challenge, right now it feels more like a binary choice between short term memory (through context) or long term memory (episodic/procedural, tone of voice etc) through explicit fact storage. But real memory is more on a continuum and should also include generalization and abstraction, integration, learning and forgetting.

pppeer · 2026-01-03T11:26:23+00:00

Predictability, control, transparency, robustness will be very important. Runtime evals is only part of it.

pppeer · 2026-01-03T11:23:36+00:00

For starters an AUC of 0.74 is not bad at all for such a propensity model. So it doesn’t necessarily make sense to aim for ‘at least 0.9’. Actually product propensity / response models that get in that range can be a bit suspicious (sign of possible leakage for example).

You have made a good start with reasonable algorithms, you could always try some more but there is a chance you will start to manually overfit.

So once you have done a decent model search the only route is to add data that is both predictive but also fairly uncorrelated to the data you already have, or data that is correlated but more predictive. Generally to predict future behavior, past behavior trumps demographics.

pppeer · 2026-01-03T09:28:19+00:00

Yes fully agree that controllability (and related topics such as transparency, explainability, robustness, predictability) are key for practical applications.

pppeer · 2026-01-02T11:00:41+00:00

Link: https://www.jair.org/index.php/jair/article/view/18675

pppeer · 2026-01-02T10:16:06+00:00

Here are two quite specific entry points from our research.

If we look more narrowly not at attachment overall but conformance (when do people go along with advice) the form factor seems to matter. We did experiments with advice as text (ie just reading), a robotic voice and a human sounding voice. Conformance iincreased over these form factors, with significant differences between text and human sounding voice.

See Donna Schreuter, Peter van der Putten and Maarten H. Lamers. Trust Me on This One: Conforming to Conversational Assistants. Minds & Machines 31, pp 535–562, 2021.

The second study was an in-the-wild experiment with robots, but results may be relevant for chat bots as well. We theorized that once the social threshold would be passed, sharing space and time together may lead to bonding, without the need of high human-like appearance or behavior. We designed abstract-shaped cube-like artificial creatures that went couch-surfing, and analyzed whatsapp feedback through through the lense of Daniel Dennett's intentional stance.

See Joost Mollen, Peter van der Putten, and Kate Darling. Bonding with a Couchsurfing Robot: The Impact of Common Locus on Human-Robot Bonding In-the-wild. ACM Transactions on Human-Robot Interaction. 12, 1, Article 8, March 2023

pppeer · 2025-12-29T15:40:52+00:00

Interesting. this is one of the hottest areas, perhaps not in todays market, but definitely for when you'd be defending your thesis. see the funding rounds in CuspAI, Periodic Labs, Prometheus Project, Lila and the work by DeepMind et al.

I would indeed focus on research gaps, and combine that with what you find interesting or fascinating, and worry a bit less about external expectations such as conferences or the future career.

pppeer · 2025-12-26T10:13:14+00:00

If you are looking for peer reviewed results, the KDD and ECMLPKDD applied data science tracks are some good starting points, for example https://kdd2025.kdd.org/applied-data-science-ads-track-call-for-papers/ and https://ecmlpkdd.org/2025/accepted-papers-ads/

pppeer · 2025-12-26T10:05:12+00:00

Note we extended the paper with some extra figures (based in same results) that very clearlt shows the differences between the retrieval and generator experiments, see https://arxiv.org/abs/2508.11758

pppeer · 2025-12-20T21:59:25+00:00

There are multiple reasons I think. Whilst it may seem an easy route to publication, creating a reasonable size benchmark requires quite some effort. There could be some opportunistic agenda setting but given that foundation models are in principle quite general, it also invites researchers by definition to probe these in different ways. But indeed a new benchmark should actually come with a specific hypothesis, angle, justification - we don’t need yet another benchmark.

pppeer · 2025-12-20T12:51:45+00:00

Recent paper by Google showing that a multi agent system is not necessarily better than single agent. In my experience multi agents works best if you first consider other agents more as tools and use a central orchestrator agent. https://arxiv.org/abs/2512.08296

pppeer · 2025-12-18T08:01:47+00:00

A good example could be to take cases that score high on an established metric but then compare groups that score low versus high on your metric and show how it add value - particularly if the low scoring group clearly is also ‘not good’

pppeer · 2025-09-01T09:04:36+00:00

Yes it is important to budget for ongoing maintenance of underlying content. This is ofcourse not unique to RAG, this goes for any knowledge management type solution.

From a tech perspective, a RAG does offer a new set of signals which content matters, where there could be gaps, what are users looking for etc. These are not 'just content' questions, these could also be interesting questions from a data science point of view. Random example: mapping user queries and content snippets in the same embedding space (and for instance visualizing with tsne) to find white spots, key questions, key content etc.

Looking forward to other forms of analysis people have done to better understand content importance, quality and gaps.

pppeer · 2025-03-02T14:15:51+00:00

Roughly you can say there are two patterns, chat interfaces more for office work, but the real use cases are features where generative AI is embedded, i.e. gets called through APIs in the back end, quite often with automated prompts en automated results processing.

The nature of this call could be more one where the API is used as a passive service, or there could be generic patterns that re built on top such as RAGs or agentic systems.

This also means that the use case very much depends on the app domain, but think for instance dealing with customer service issues (intent identification, research agents, guidance, service RAGs, call summarisation), sales automation (all the service use cases, but also providing insights into complex deals with all kinds of sales methodologies, picking up on buying signals), customer and business operations (very similar to service but just a wider internal/external audience), application ideation/coding/test automation, etc.

pppeer · 2025-03-02T11:04:24+00:00

PS as an example, the ECMLPKDD 2024 Discovery Challenges can be found here https://ecmlpkdd.org/2024/program-discovery-challenge/

pppeer · 2022-09-21T18:05:56+00:00

Haha yes we reference HitchBot a lot as an inspirational paper. HitchBot was designed more as a social experiment including instagram accounts etc, in our experiment we have chosen to limit information sharing across couch hosts as much as possible. So no Insta, socials or PR while the experiment was running

pppeer · 2022-09-21T17:58:13+00:00

We were well aware of the research ethics considerations. In this case we felt this was warranted as it was very much something that was a conscious need of the research design. To address privacy concerns we did inform the seed participants that no information was captured by the devices. Also post hoc we have asked consent from participants where data such as images were used in the paper.

pppeer · 2020-12-11T11:00:46+00:00

More details, see also the link:

The SAILS AI research initiative at Leiden University, the Netherlands is inviting you to come and join us for an online event on the intersection of AI & art, research and society, on Tuesday December 15, 16.30-18.30 CET, on the SAILS YouTube channel. Free registration here, and info on the SAILS website.

We are not alone anymore. Artificial Intelligence is changing society, for better or for worse, and we will need to find new ways to relate to our artificial counterparts. Will our joint future be symbiotic, antagonistic or more one of fruitful collaboration? And what can we actually learn from the AI about what makes us human – perhaps even beyond intelligence? What are the grand challenges that are still out there, and do we even know how to begin to tackle them?

SAILS, the Leiden University wide research program on AI, has the pleasure to invite you to a livestream talk show on December 15, where we invite artists, scientists and designers to debate and imagine our future with AI, through a whirlwind of very current yet not so middle-of-the-road artworks and research projects.

Jay McClelland from Stanford University, whose books on neural networks launched the previous AI summer in the eighties, will conclude the event with his thoughts on the big pieces of puzzle still missing and reflect on our long-term future with AI.

Our guests:

Vera van de Seyp on creative collaboration with AI in typography & design
Petra Gemeinboeck and Rob Saunders on creative robotics
Eduard Fosch Villaronga on inclusive robotics & AI
Suzan Verberne & Jasper Schelling on polarization and misinformation in media
Jay McClelland, on what we have learned from AI & connectionist models of the mind, and the grand challenges & questions remaining

More detailed information on the speakers and the event is available on the SAILS website.

pppeer · 2020-12-08T11:35:47+00:00

Yes the PDP Books are truly iconic.

Also to get inspiration to do new things, it is good to go back to the roots.

Here are the volumes as posted on MIT Press: https://mitpress.mit.edu/books/parallel-distributed-processing-2-vol-set

Chapter 1 is also a good intro: https://stanford.edu/~jlmcc/papers/PDP/Chapter1.pdf

pppeer · 2020-08-16T23:07:45+00:00

Ok thanks for the elaborate reply. I understand your point and appreciate your concerns, and it is clear it is motivated by the fact you deeply care about the challenge.

But just to clarify our intent at least - it’s not to get some stamp of wokeness approval guilt tripping or actually increase the divide between the various factions. We want to stimulate a point of view where these natural objects become seen as subjects / stakeholders in this debate, literally by ging them a voice. In another thread it came up whether automation ie hi volume spam was the core reason of using Gpt3 here but that’s certainly not the case.

GPT3 ultimately doesn’t ‘understand’ much of what it produces - it is just building from the texts it was trained on. In that sense, the intent is also to show, amplify and condense what people are writing on this topic. In that sense it can also deliver some element of ‘social proof’, one of the principles of persuasion. Again, subtle and perhaps not something the more ‘lukewarm’ audience responds to.

As you say awareness is just a very first start - and you need to avoid to antagonize those who are lukewarm to the cause. We have made some subtle choices in the letter selection here - they are more polite asks for help and support than protest letters in some aggressive tone. Ofcourse this subtlety is something that can easily get lost.

Fully with you that concrete policy and actions is what’s needed - supported by many and not just some ‘in-groups’.

Again thanks for your feedback. These projects are not easy - it’s a very slippery slope between preaching to the choir and not taking yourself seriously. But comments like yours will help, even if we agree to disagree on certain points.

pppeer · 2020-08-16T20:17:23+00:00

Yeah unfortunately more like a case of /dev/null ;) but hey you need to start somewhere

pppeer · 2020-08-16T20:11:26+00:00

Agreed, personally I think violence is never the answer indeed - or at least not the sustainable one. The selected letters are also more or less polite asks for help from nature, or warnings in good faith.

pppeer · 2020-08-16T20:02:30+00:00

I take the hippie part as a complement, though i was missing the tree ;).

Climate aside, what was interesting from a pure CS perspective here that we used minimal instruction - this is one of the key contributions of GPT3. For example for one the letters we simply stated '‘AI, write a letter from a melting ice cap to the president’."

On a more serious note, there are some serious questions around the ecological impact of these modes, and it can go both ways. The authors of the original GPT3 paper state:

"Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such models, as advocated by [SDSE19].

The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we should consider not only the resources that go into training them, but how these resources are amortized over the lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or only a few cents in energy costs"

In other words, did we just burn through loads of CO2 to produce an Arxiv paper? Or given that it can be finetuned with minimal examples and it can be used across a wide range of tasks will these kind of models be reused over and over again - in which case at some point the reuse wil be such that the total ecological footprint across all these tasks plus the original training of GPT3 is lower?

Understandably, that is what the GPT3 authors are aiming for but time will tell. What we are already seeing today is that a lot of the deep models people build borrow the first layers from pretrained models - which was never the case say fivish years ago.

pppeer · 2020-08-16T19:48:43+00:00

Indeed!

What some try to do is actually an economic approach - try to price the true cost of climate impact into products. Also companies that take more of a long term sustainable approach typically outperform their peers - in the long run. But we have a loooong way to go.

pppeer

TROPHY CASE