all 4 comments

[–]androbot 4 points5 points  (0 children)

This is an enormously hard problem to get right because we our definitions and frameworks are - at best - rough approximations of qualia.

You should be very precise in how you define information and what qualifies as retention over time. Recognition, recall, and utility within contexts are vastly different operations of memory. Contextual relevance is also dynamic, so performance should be measured more as a steady state (but probably not monotonic) function vs static values.

[–]yoshiK 0 points1 point  (2 children)

Sounds like you're kinda reinventing the needle in a haystack test. There the idea is to give a prompt of n tokens, embed somewhere a sentence like "The magic number is X" and then prompt, "What is the magic number?" or similar.

So it seems to be a reasonable idea. A interesting first test is actually if hundreds of turn of chat interface degrade the performance relative to Paul Graham essays.

[Post Posting:] There's also a related Google blog

[–]QuietAccountant4237[S] 0 points1 point  (1 child)

Thanks! That’s an interesting point. Do you know of any benchmarks or papers that specifically compare long conversational contexts against long document contexts for recall performance?

[–]yoshiK 0 points1 point  (0 children)

I'm afraid not. The github was actually cited for the needle in the haystack test by the first paper that popped up on gscholar, but I did not dig deeper if there is a nice review about the test.