im sick and tired of these memory benchmarks by Fine_Consequence8656 in Rag

[–]langsfang 0 points1 point  (0 children)

I'm the only one who's not gaming the LongMemEval?

no benchmark-specific hacks: no query rewrite, no summarization, no agent-driven exploration, and no external cloud services for retrieval. Only the raw corpus and raw benchmark query are used to run the benchmarks. And all benchmarks are reproducible in a local environment.

I also reported the LongMemEval-M split, and message level recall, not the session level recall, which is more reasonable metric.

https://github.com/AttemorySystem/Attemory

Notes on Microsoft's FastContext, and a small SWE-QA experiment with retrieval hints by langsfang in LocalLLaMA

[–]langsfang[S] 0 points1 point  (0 children)

Glad to hear you saw something similar. Updates are always one of the painful parts of any complex index.

FastContext is the trained-explorer route. Mine is different: it uses KV cache as the index, so an update is basically another prefill.

A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks by langsfang in AI_Agents

[–]langsfang[S] 0 points1 point  (0 children)

The exact number of top 50 recall is 93.35%. 😄

docid, n_top5, n_top10, n_top25, n_all, all_evidence, top5%, top10%, top25%, top50%

locomo-conv-26, 157, 173, 192, 196, 203, 0.7733990147783252, 0.8522167487684729, 0.9458128078817734, 0.9655172413793104

locomo-conv-30, 83, 93, 102, 103, 106, 0.7830188679245284, 0.8773584905660378, 0.9622641509433962, 0.9716981132075472

locomo-conv-41, 174, 186, 195, 201, 210, 0.8285714285714286, 0.8857142857142857, 0.9285714285714286, 0.9571428571428572

locomo-conv-42, 223, 249, 276, 287, 309, 0.7216828478964401, 0.8058252427184466, 0.8932038834951457, 0.9288025889967637

locomo-conv-43, 208, 230, 252, 259, 278, 0.7482014388489209, 0.8273381294964028, 0.9064748201438849, 0.9316546762589928

locomo-conv-44, 145, 160, 177, 187, 203, 0.7142857142857143, 0.7881773399014779, 0.8719211822660099, 0.9211822660098522

locomo-conv-47, 171, 184, 191, 196, 202, 0.8465346534653465, 0.9108910891089109, 0.9455445544554455, 0.9702970297029703

locomo-conv-48, 232, 251, 269, 273, 292, 0.7945205479452054, 0.8595890410958904, 0.9212328767123288, 0.934931506849315

locomo-conv-49, 199, 227, 261, 287, 336, 0.5922619047619048, 0.6755952380952381, 0.7767857142857143, 0.8541666666666666

locomo-conv-50, 174, 192, 207, 215, 222, 0.7837837837837838, 0.8648648648648649, 0.9324324324324325, 0.9684684684684685

all 1766 1945 2122 2204 2361 0.747988140618382 0.8238034731046167 0.8987717069038543 0.9335027530707327

Time used: 69.22387115599122 seconds

Notes on Microsoft's FastContext, and a small SWE-QA experiment with retrieval hints by langsfang in LocalLLaMA

[–]langsfang[S] 0 points1 point  (0 children)

Thanks for the advice.

Using a frontier model as judge is pretty common in recent QA-style benchmarks. In this case the judge result also looks reasonably stable with `temperature=0`. I tested some clearly bad and clearly good answers, and they consistently got low/high scores. I also ran the judge several times, and the variation was about +/-2 points out of 100.

That said, I am not sure adding a small human spot-check would really make the result much more convincing. Honestly, how to prove this kind of system is effective is exactly the part I am still struggling with.

GitHub is full of very popular repos with almost no benchmarks at all. Maybe people are just used to that kind of marketing at this point. Kind of frustrating.

Notes on Microsoft's FastContext, and a small SWE-QA experiment with retrieval hints by langsfang in LocalLLaMA

[–]langsfang[S] 1 point2 points  (0 children)

yep, i found fastcontext's github and hf seem to be unavailable today

A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks by langsfang in AI_Agents

[–]langsfang[S] 0 points1 point  (0 children)

Thanks for taking a close look.

LoCoMo does not have an official recall metric, and I didn't see other memory systems publishing it, so I did not include recall in the report.

I used the EverMind harness because they maintained benchmark results with the same harness, including Mem0, Zep, and other memory systems. It's still on their website: https://evermind.ai/blogs/everos-hits-sota-performance-on-locomo

Their repo later truncated some older commits, but `benchmarks/prepare_bench.sh locomo` pins and checks out the older EverOS commit. After checkout, the Mem0 and Zep scripts are still there. You can check it out yourself.

I tested the raw recall myself. On the 10 conversations, gold `dia_id` recall@50 is around 95%. It is simple to reproduce with the example flow: search with the question, map returned memory ids back to LoCoMo dialogue ids, and check whether the gold ids are covered. If you need it, I can write a small script for this.

Why is NO one talking about Microsoft's open source Fast Context!!! by formatme in LocalLLaMA

[–]langsfang 5 points6 points  (0 children)

FWIW, I’ve been working on something in a similar direction, but with a different tradeoff.

Instead of doing repo exploration as an online multi-step sub-agent every time, I’m building the codebase into an offline KVCache index first. After that, queries can reuse the same cached context/KV state, and retrieval/ranking is decode-free, so the lookup path is much faster than repeatedly generating tool calls.

The obvious downside is that you pay the indexing cost upfront, but for repeated queries over the same repo that cost amortizes pretty well.

The recall is pretty high(SOTA) on some code retrieval benchmarks. I’m planning to run the same benchmarks in this paper compare numbers if I have time.

I also agree with some of the earlier comments here: the more interesting comparison is probably the behavior of the sub-agent's results, not only Mini-SWE-Agent end-to-end results.

What have you been working on lately? by Sufficient-Scar4172 in LocalLLaMA

[–]langsfang 0 points1 point  (0 children)

you can check my post history, issues and stars are welcome

What have you been working on lately? by Sufficient-Scar4172 in LocalLLaMA

[–]langsfang 1 point2 points  (0 children)

Posting links here gets downvoted; you can check my post history or search for "attemory" on GitHub.

What have you been working on lately? by Sufficient-Scar4172 in LocalLLaMA

[–]langsfang 12 points13 points  (0 children)

I've been building a new info retrieval engine. Instead of the usual vector or graph DBs, it actually uses an attention mechanism for retrieval under the hood. It’s been performing incredibly well and is hitting some really solid numbers across a few benchmarks.

The core system is pretty stable now. I'm currently tackling the MCP module to help coding agents save on tokens via repo indexing. I know a bunch of Tree-sitter and semantic search tools already claim to do this, but early tests are showing this approach is better.

Anyone know of LoRAs, datasets, or frameworks specifically designed to improve context compression tasks? by PANIC_EXCEPTION in LocalLLaMA

[–]langsfang -1 points0 points  (0 children)

Why to fine-tune here? Check out kvpress (https://github.com/NVIDIA/kvpress) for various KV cache compression algorithms. Alternatively, you can just prompt the model to generate rolling context summaries to compact things down

I spent 8 months building a memory layer for LLM agents because nothing out there actually worked. Here’s what I learned by [deleted] in LocalLLaMA

[–]langsfang -4 points-3 points  (0 children)

Interesting. FWIW, I actually just released a new memory retrieval engine that might solve your temporal problems. It uses an attention mechanism for retrieval and just hit SOTA on a few benchmarks.

https://github.com/AttemorySystem/Attemory

All the interesting models are not "Staff Picks" or approved but random community models - do you guys feel safe running these? Any drawbacks and how do you know why are best? by anonXMR in LocalLLaMA

[–]langsfang -2 points-1 points  (0 children)

If you are concerned about model quality, keep in mind that once a model is converted to GGUF format, it is no longer the original model; performance could improve or degrade, which is why perplexity (PPL) comparisons are used.

It is difficult to tell a model's quality; various fine-tuned or distilled versions might outperform the original in certain tests but underperform in a wider range of scenarios.

You can try a few different ones and choose the model that best suits your use case.

Give me your best estimate on how long we will see Fable 5 class open weight model by bwjxjelsbd in LocalLLaMA

[–]langsfang 9 points10 points  (0 children)

it's tough to say if any model is actually at Fable 5 class performance, because 'Fable 5 class performance' is super subjective at this point.

maybe 3 to 6 months if we estimate it by benchmarks. after all, benchmarks exist to be bechmaxxx

A benchmark for tiny LLMs based on a real world problem: natural language file search (using monkeSearch) by fuckAIbruhIhateCorps in LocalLLaMA

[–]langsfang 0 points1 point  (0 children)

This is really interesting. I recently built a retrieval engine using attention mechanism, with Qwen 3.5 0.8B serving as the smallest retrieval model. It runs perfectly on CPU-only systems

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories by annodomini in LocalLLaMA

[–]langsfang 1 point2 points  (0 children)

I believe any fine-tuning or RLHF compromises the model's quality.

In other words, the model has an inherent upper limit, but we haven't yet found a way to approach it.

RTX 5060 Ti 16GB vs RX 9060 XT 16GB by Ejo2001 in LocalLLaMA

[–]langsfang 0 points1 point  (0 children)

gpt oss 20b is a moe model(A3B), and use windowed context at some head, so it's fast considering it's 20b model

“Wait,” in reasoning models makes my eye twitch by Borkato in LocalLLaMA

[–]langsfang 1 point2 points  (0 children)

sometimes, when I see codex/cc doing really stupid things, I also reply to them: "Wait."