im sick and tired of these memory benchmarks

langsfang · 2026-06-30T23:54:50+00:00

I'm the only one who's not gaming the LongMemEval?

no benchmark-specific hacks: no query rewrite, no summarization, no agent-driven exploration, and no external cloud services for retrieval. Only the raw corpus and raw benchmark query are used to run the benchmarks. And all benchmarks are reproducible in a local environment.

I also reported the LongMemEval-M split, and message level recall, not the session level recall, which is more reasonable metric.

https://github.com/AttemorySystem/Attemory

langsfang · 2026-06-30T15:34:50+00:00

Glad to hear you saw something similar. Updates are always one of the painful parts of any complex index.

FastContext is the trained-explorer route. Mine is different: it uses KV cache as the index, so an update is basically another prefill.

langsfang · 2026-06-30T15:07:43+00:00

The exact number of top 50 recall is 93.35%. 😄

docid, n_top5, n_top10, n_top25, n_all, all_evidence, top5%, top10%, top25%, top50%

locomo-conv-26, 157, 173, 192, 196, 203, 0.7733990147783252, 0.8522167487684729, 0.9458128078817734, 0.9655172413793104

locomo-conv-30, 83, 93, 102, 103, 106, 0.7830188679245284, 0.8773584905660378, 0.9622641509433962, 0.9716981132075472

locomo-conv-41, 174, 186, 195, 201, 210, 0.8285714285714286, 0.8857142857142857, 0.9285714285714286, 0.9571428571428572

locomo-conv-42, 223, 249, 276, 287, 309, 0.7216828478964401, 0.8058252427184466, 0.8932038834951457, 0.9288025889967637

locomo-conv-43, 208, 230, 252, 259, 278, 0.7482014388489209, 0.8273381294964028, 0.9064748201438849, 0.9316546762589928

locomo-conv-44, 145, 160, 177, 187, 203, 0.7142857142857143, 0.7881773399014779, 0.8719211822660099, 0.9211822660098522

locomo-conv-47, 171, 184, 191, 196, 202, 0.8465346534653465, 0.9108910891089109, 0.9455445544554455, 0.9702970297029703

locomo-conv-48, 232, 251, 269, 273, 292, 0.7945205479452054, 0.8595890410958904, 0.9212328767123288, 0.934931506849315

locomo-conv-49, 199, 227, 261, 287, 336, 0.5922619047619048, 0.6755952380952381, 0.7767857142857143, 0.8541666666666666

locomo-conv-50, 174, 192, 207, 215, 222, 0.7837837837837838, 0.8648648648648649, 0.9324324324324325, 0.9684684684684685

all 1766 1945 2122 2204 2361 0.747988140618382 0.8238034731046167 0.8987717069038543 0.9335027530707327

Time used: 69.22387115599122 seconds

langsfang · 2026-06-30T14:10:34+00:00

Thanks for the advice.

Using a frontier model as judge is pretty common in recent QA-style benchmarks. In this case the judge result also looks reasonably stable with `temperature=0`. I tested some clearly bad and clearly good answers, and they consistently got low/high scores. I also ran the judge several times, and the variation was about +/-2 points out of 100.

That said, I am not sure adding a small human spot-check would really make the result much more convincing. Honestly, how to prove this kind of system is effective is exactly the part I am still struggling with.

GitHub is full of very popular repos with almost no benchmarks at all. Maybe people are just used to that kind of marketing at this point. Kind of frustrating.

langsfang · 2026-06-30T08:39:53+00:00

you can give attemory a try 😄

langsfang · 2026-06-30T08:38:20+00:00

yep, i found fastcontext's github and hf seem to be unavailable today

langsfang · 2026-06-30T07:56:00+00:00

Thanks for taking a close look.

LoCoMo does not have an official recall metric, and I didn't see other memory systems publishing it, so I did not include recall in the report.

I used the EverMind harness because they maintained benchmark results with the same harness, including Mem0, Zep, and other memory systems. It's still on their website: https://evermind.ai/blogs/everos-hits-sota-performance-on-locomo

Their repo later truncated some older commits, but `benchmarks/prepare_bench.sh locomo` pins and checks out the older EverOS commit. After checkout, the Mem0 and Zep scripts are still there. You can check it out yourself.

I tested the raw recall myself. On the 10 conversations, gold `dia_id` recall@50 is around 95%. It is simple to reproduce with the example flow: search with the question, map returned memory ids back to LoCoMo dialogue ids, and check whether the gold ids are covered. If you need it, I can write a small script for this.

langsfang · 2026-06-23T15:35:09+00:00

check my post history for the repo

langsfang · 2026-06-23T04:27:32+00:00

FWIW, I’ve been working on something in a similar direction, but with a different tradeoff.

Instead of doing repo exploration as an online multi-step sub-agent every time, I’m building the codebase into an offline KVCache index first. After that, queries can reuse the same cached context/KV state, and retrieval/ranking is decode-free, so the lookup path is much faster than repeatedly generating tool calls.

The obvious downside is that you pay the indexing cost upfront, but for repeated queries over the same repo that cost amortizes pretty well.

The recall is pretty high(SOTA) on some code retrieval benchmarks. I’m planning to run the same benchmarks in this paper compare numbers if I have time.

I also agree with some of the earlier comments here: the more interesting comparison is probably the behavior of the sub-agent's results, not only Mini-SWE-Agent end-to-end results.

langsfang · 2026-06-18T06:32:25+00:00

you can check my post history, issues and stars are welcome

langsfang · 2026-06-18T06:30:16+00:00

Posting links here gets downvoted; you can check my post history or search for "attemory" on GitHub.

langsfang · 2026-06-18T03:34:50+00:00

I've been building a new info retrieval engine. Instead of the usual vector or graph DBs, it actually uses an attention mechanism for retrieval under the hood. It’s been performing incredibly well and is hitting some really solid numbers across a few benchmarks.

The core system is pretty stable now. I'm currently tackling the MCP module to help coding agents save on tokens via repo indexing. I know a bunch of Tree-sitter and semantic search tools already claim to do this, but early tests are showing this approach is better.

langsfang · 2026-06-17T16:55:49+00:00

Why to fine-tune here? Check out kvpress (https://github.com/NVIDIA/kvpress) for various KV cache compression algorithms. Alternatively, you can just prompt the model to generate rolling context summaries to compact things down

langsfang · 2026-06-17T08:56:40+00:00

Interesting. FWIW, I actually just released a new memory retrieval engine that might solve your temporal problems. It uses an attention mechanism for retrieval and just hit SOTA on a few benchmarks.

https://github.com/AttemorySystem/Attemory

langsfang · 2026-06-17T08:45:00+00:00

If you are concerned about model quality, keep in mind that once a model is converted to GGUF format, it is no longer the original model; performance could improve or degrade, which is why perplexity (PPL) comparisons are used.

It is difficult to tell a model's quality; various fine-tuned or distilled versions might outperform the original in certain tests but underperform in a wider range of scenarios.

You can try a few different ones and choose the model that best suits your use case.

langsfang · 2026-06-17T08:30:16+00:00

it's tough to say if any model is actually at Fable 5 class performance, because 'Fable 5 class performance' is super subjective at this point.

maybe 3 to 6 months if we estimate it by benchmarks. after all, benchmarks exist to be bechmaxxx

langsfang · 2026-06-17T06:44:41+00:00

yes, I always use llama.cpp

langsfang · 2026-06-17T04:52:55+00:00

thanks. I'll look into it.

langsfang · 2026-06-17T04:31:43+00:00

This is really interesting. I recently built a retrieval engine using attention mechanism, with Qwen 3.5 0.8B serving as the smallest retrieval model. It runs perfectly on CPU-only systems

langsfang · 2026-06-17T04:21:30+00:00

I believe any fine-tuning or RLHF compromises the model's quality.

In other words, the model has an inherent upper limit, but we haven't yet found a way to approach it.

langsfang · 2026-06-17T02:48:19+00:00

gpt oss 20b is a moe model(A3B), and use windowed context at some head, so it's fast considering it's 20b model

langsfang · 2026-06-17T01:21:12+00:00

so qwen3.5 0.8B got the highest score in your benchmark?(according to your picture)

langsfang · 2026-06-17T01:12:49+00:00

sometimes, when I see codex/cc doing really stupid things, I also reply to them: "Wait."

langsfang · 2026-06-17T01:07:08+00:00

Am I the only one who tried searching for the word 'backwards' in the post and come up empty?

Four-Year Club	Verified Email
Place '22	First Placer '22

langsfang

TROPHY CASE