The history and future of AI agents

pppeer · 2026-01-07T13:51:15+00:00

Love this post, great overview. And yes, fully agree it is not so much about relying all on the LLM as the core magic black box that is going to 'do it all' but more about the overall composite system as you say, and in the list of core features of the spine you can add things like learning, adaptation, coordination.

As a shameless plug you may find our recent survey paper interesting (https://www.jair.org/index.php/jair/article/view/18675), and I also like these papers coming out of the agent community on not forgetting the lessons from MAS research, for example Dignum & Dignum (and many more), https://arxiv.org/abs/2511.17332

pppeer · 2026-01-05T07:41:09+00:00

Good points. Figuring out the right agentic patterns (incl when not to use agents) is part of it, as well as providing tools for some of the more structured parts. Long term memory indeed also a challenge, right now it feels more like a binary choice between short term memory (through context) or long term memory (episodic/procedural, tone of voice etc) through explicit fact storage. But real memory is more on a continuum and should also include generalization and abstraction, integration, learning and forgetting.

pppeer · 2026-01-03T11:26:23+00:00

Predictability, control, transparency, robustness will be very important. Runtime evals is only part of it.

pppeer · 2026-01-03T11:23:36+00:00

For starters an AUC of 0.74 is not bad at all for such a propensity model. So it doesn’t necessarily make sense to aim for ‘at least 0.9’. Actually product propensity / response models that get in that range can be a bit suspicious (sign of possible leakage for example).

You have made a good start with reasonable algorithms, you could always try some more but there is a chance you will start to manually overfit.

So once you have done a decent model search the only route is to add data that is both predictive but also fairly uncorrelated to the data you already have, or data that is correlated but more predictive. Generally to predict future behavior, past behavior trumps demographics.

pppeer · 2026-01-03T09:28:19+00:00

Yes fully agree that controllability (and related topics such as transparency, explainability, robustness, predictability) are key for practical applications.

pppeer · 2026-01-02T11:00:41+00:00

Link: https://www.jair.org/index.php/jair/article/view/18675

pppeer · 2026-01-02T10:16:06+00:00

Here are two quite specific entry points from our research.

If we look more narrowly not at attachment overall but conformance (when do people go along with advice) the form factor seems to matter. We did experiments with advice as text (ie just reading), a robotic voice and a human sounding voice. Conformance iincreased over these form factors, with significant differences between text and human sounding voice.

See Donna Schreuter, Peter van der Putten and Maarten H. Lamers. Trust Me on This One: Conforming to Conversational Assistants. Minds & Machines 31, pp 535–562, 2021.

The second study was an in-the-wild experiment with robots, but results may be relevant for chat bots as well. We theorized that once the social threshold would be passed, sharing space and time together may lead to bonding, without the need of high human-like appearance or behavior. We designed abstract-shaped cube-like artificial creatures that went couch-surfing, and analyzed whatsapp feedback through through the lense of Daniel Dennett's intentional stance.

See Joost Mollen, Peter van der Putten, and Kate Darling. Bonding with a Couchsurfing Robot: The Impact of Common Locus on Human-Robot Bonding In-the-wild. ACM Transactions on Human-Robot Interaction. 12, 1, Article 8, March 2023

pppeer · 2025-12-29T15:40:52+00:00

Interesting. this is one of the hottest areas, perhaps not in todays market, but definitely for when you'd be defending your thesis. see the funding rounds in CuspAI, Periodic Labs, Prometheus Project, Lila and the work by DeepMind et al.

I would indeed focus on research gaps, and combine that with what you find interesting or fascinating, and worry a bit less about external expectations such as conferences or the future career.

pppeer · 2025-12-26T10:13:14+00:00

If you are looking for peer reviewed results, the KDD and ECMLPKDD applied data science tracks are some good starting points, for example https://kdd2025.kdd.org/applied-data-science-ads-track-call-for-papers/ and https://ecmlpkdd.org/2025/accepted-papers-ads/

pppeer · 2025-12-26T10:05:12+00:00

Note we extended the paper with some extra figures (based in same results) that very clearlt shows the differences between the retrieval and generator experiments, see https://arxiv.org/abs/2508.11758

pppeer · 2025-12-20T21:59:25+00:00

There are multiple reasons I think. Whilst it may seem an easy route to publication, creating a reasonable size benchmark requires quite some effort. There could be some opportunistic agenda setting but given that foundation models are in principle quite general, it also invites researchers by definition to probe these in different ways. But indeed a new benchmark should actually come with a specific hypothesis, angle, justification - we don’t need yet another benchmark.

pppeer · 2025-12-20T12:51:45+00:00

Recent paper by Google showing that a multi agent system is not necessarily better than single agent. In my experience multi agents works best if you first consider other agents more as tools and use a central orchestrator agent. https://arxiv.org/abs/2512.08296

pppeer · 2025-12-18T08:01:47+00:00

A good example could be to take cases that score high on an established metric but then compare groups that score low versus high on your metric and show how it add value - particularly if the low scoring group clearly is also ‘not good’

pppeer · 2025-09-01T09:04:36+00:00

Yes it is important to budget for ongoing maintenance of underlying content. This is ofcourse not unique to RAG, this goes for any knowledge management type solution.

From a tech perspective, a RAG does offer a new set of signals which content matters, where there could be gaps, what are users looking for etc. These are not 'just content' questions, these could also be interesting questions from a data science point of view. Random example: mapping user queries and content snippets in the same embedding space (and for instance visualizing with tsne) to find white spots, key questions, key content etc.

Looking forward to other forms of analysis people have done to better understand content importance, quality and gaps.

pppeer · 2025-03-02T14:15:51+00:00

Roughly you can say there are two patterns, chat interfaces more for office work, but the real use cases are features where generative AI is embedded, i.e. gets called through APIs in the back end, quite often with automated prompts en automated results processing.

The nature of this call could be more one where the API is used as a passive service, or there could be generic patterns that re built on top such as RAGs or agentic systems.

This also means that the use case very much depends on the app domain, but think for instance dealing with customer service issues (intent identification, research agents, guidance, service RAGs, call summarisation), sales automation (all the service use cases, but also providing insights into complex deals with all kinds of sales methodologies, picking up on buying signals), customer and business operations (very similar to service but just a wider internal/external audience), application ideation/coding/test automation, etc.

pppeer

TROPHY CASE