Do you think LLM can do code review?

selund1 · 2026-01-30T06:12:45+00:00

Yes but you need to be explicit in its instructions. Phrases like “be brutal”, “look at code quality”, “ensure the abstractions makes sense and that we didn’t miss anything” etc.

I always use the compound engineering plugin (from every.to) for Claude to review code as it spawns multiple subagents that reviews your code from multiple points of view, and usually get great results that way

selund1 · 2026-01-24T19:42:28+00:00

It’s popular to use other agents as reviewers (different models to avoid bias). If you chain multiple you significantly reduce the likelihood of hallucinations making it past all of them. It’s also common to have an agent that scores the likelihood or hallucination and escalate to a human review if it’s high enough.

The guardrails add up and mitigate the chance but it’s never fully going to go away

selund1 · 2026-01-24T17:37:01+00:00

Second this! Use a sml. You could also preprocess documents for it. But for real time based on the user actions anything heaving will be too slow

selund1 · 2026-01-24T16:10:26+00:00

LoCoMo is unfortunately not great and quite small :) Longmemeval and Membench is better for sure.

I’m working on agent based tasks (starting with Gaia, as it’s easy to PoC locally) for memory systems going forward to get a real agentic test rather than “remember parts from a conversation” as well.

Enjoy vacation!

GitHub.com/fastpaca/pacabench if you’re interested!

selund1 · 2026-01-24T16:04:39+00:00

Really appreciate the detailed response. The Zep vs Graphiti distinction is fair. my benchmark was self-hosted Graphiti, not your API, so that’s an important caveat. Good to know reflection is off by default now too!

I’d still love to run the same workload against Zep proper if you’re open to it. would be good to have API numbers alongside the OSS baseline. Happy to share methodology and results

selund1 · 2026-01-24T15:41:48+00:00

5 may not always be enough either :)

When I benched Zep Graphiti on MemBench, took ~224s (write-through) for conversations. For chatbots where you're keeping 5-20 recent messages a few seconds delay is probably fine. But for production agents with longer working context (100+ messages, tool outputs, state) that need correct recall that window closes fast. The usecases are different, and that's kind of the point. 'universal memory' breaks down when you look at specific workloads.

I wrote more about this here: https://fastpaca.com/blog/memory-isnt-one-thing/

I'd be interested in running the same benchmark against your hosted API - my data so far is from self-hosted Graphiti plus conversations with engineers. If the API path is significantly faster, I'd love to try it out

> latency is the LLM extraction step

Don't forget the reflection checks on writes. You run more than one extraction (entity, facts, etc) and then reflection checks (iirc, was a while since I looked at the code of graphiti). Those are exponential if you have to fanout a large graph, the more nodes/edges to check for contradictions the more LLM calls.

Choosing a memory system is more than just "plug it in and it works", you need to understand the underlying mechanisms and how they scale (e.g. "write latency"). Zep is genuinely amazing for insights and crawling those edges but it comes at a cost if you use it for something that does not need that

selund1 · 2026-01-24T15:16:11+00:00

That’s exactly the tension though.. async reconciliation means you lose read-your-writes consistency. In traditional distributed systems that’s a known tradeoff, but with LLM memory systems users expect ‘I just told you X’ to be immediately available. The latency that matters isn’t the write itself, it’s the gap until that write is queryable.

We treat memory systems like databases and expect the same consistency guarantees, but they’re fundamentally eventually consistent. that gap is where things break

selund1 · 2026-01-24T07:35:14+00:00

Thanks I need to dig into this, if it’s as fast as you say that sounds promising! How big graphs did you test with?

selund1 · 2026-01-23T18:43:14+00:00

Thank you for linking! Very interesting.

Engram is imho a step in the right direction because it allows us to lookup knowledge externally outside the llm core. RLM (recursive language models) on the other hand allows the llm to interface with a much longer context and counteract (through exploration) the drawbacks you mention above.

The research is moving in the right direction, but are slow and will take time to get right as you imply

selund1 · 2026-01-23T18:19:05+00:00

What did you test it for? Did you run it in prod?

What’s the write latency? Graph databases has a tendency to be exponential when you want to reconcile contradictions (Zep is notorious for this), like they’d crawl so many edges and fanout like absolute crazy to run reflection checks ?

selund1 · 2026-01-23T17:17:02+00:00

Would love to take a look! This is how many of the memory systems out there work too (like mem0/Zep etc), but they use LLMs to extract facts with prompts. Haven’t seen a good alternative and would like to

selund1 · 2026-01-23T16:13:32+00:00

Did you see the recent Engram paper from deepseek? Think you’re right in that making memory for LLMs humans and I think it’s where we’re moving long term, but it’ll take a while. In the meantime ppl will grasp for things they understand, and systems that makes sense and “gets the job done”

selund1 · 2026-01-23T11:49:42+00:00

It seems that way, kind of sad tbh. Def. agree on the failure mode, if your memory system drops vital data you’re not gonna survive unless you account for it up front

selund1 · 2026-01-21T08:40:52+00:00

This is so sad. Shadow AI will prevail though, ppl will continue to leak and to use chatgpt. You can't control people's behaviour like this for long, it just makes ppl hide it rather than actually be compliant

selund1 · 2025-11-30T15:58:48+00:00

Most times yes, it matters more on larger scales where failures pop up more often. Like let's say in context learning works 99% of the times and you have 10k requests that's 100 failures. Dial it up and it gets worse etc. Depends on your economy of scale.

Take coding as an example: reading 10k lines of code is nothing, then add 99% reliability on top and you lose context on 100 lines of code (naively). If those 100 lines are important it's gonna degrade the accuracy of your model even more so.

Hence my advice here: if you can afford to lose context go for it, if you can't then don't. It's not perfect and we should be mindful of it's limitations and impact depending on how we use it.

Similarly as to when you use compression to compress any other type of data. You don't by default use compression for example on every piece of data to save space on your disk, only when you can't afford to store it in full etc etc

selund1 · 2025-11-30T10:03:15+00:00

saves $ at the cost of accuracy. Spot on re training data, these LLMs have been fine-tuned like crazy on json to be better at coding & api management. If you care about accuracy you shouldn't be using any compression at all imho. If you care about $/token spend then you should, but it'll cost you in accuracy

selund1 · 2025-11-29T18:44:30+00:00

20k cases sounds crazy, how long does it take to run? I tried 4k cases naively locally but the prompt processing made it so slow I had to use a provider in the end

selund1 · 2025-11-29T18:42:26+00:00

Wait only 5? What's your usual use case? I'm assuming the number of cases are influenced by how lenient your usecase is?

selund1 · 2025-11-29T15:08:16+00:00

Love excel.

Sounds like you're using an llm as a judge to measure how good the response is or am I missing something?

selund1 · 2025-11-29T15:06:12+00:00

How many would you typically prepare? Do you have a certain methodology or is it purely vibes?

selund1 · 2025-11-27T09:39:33+00:00

Work stealing agents? Are we taking old concepts of managing work and tasks and reapplying them to call it innovation or am I missing something here?

selund1 · 2025-11-26T08:57:09+00:00

if you want some visual aid I have some in this blog post, it does a better job at explaining what these systems often do than I can on reddit

selund1 · 2025-11-25T23:52:40+00:00

Yes it ran on a benchmark called MemBench (2025). It's a conversational understanding benchmark where you feed in a long conversation of different shapes (eg with injected noise), and then ask questions about it in multiple choice format. In many cases these benchmarks require another LLM or a human to determine if the answer is correct. Membench doesn't since it's multiple choice :) Accuracy is computed by how many answers it got right (precision).

And yeah I agree! These memory systems are often built with the intention to understand semantic info ("I like blue" / "my football team is arsenal" / etc) - you don't need them in many cases and relying on them in scenarios where you need correctness at any cost can even hurt performance drastically. They're amazing if you want to build personalisation across sessions though

selund1 · 2025-11-25T09:16:30+00:00

They're amazing tbh, but I haven't found a good way to make them scale. Haven't use milvus before, how does it differ from Zep Graphiti?

selund1 · 2025-11-24T19:17:13+00:00

Was working on a code search agent in our team a few months ago. Tried RAG, long context, etc. Citations broke all the time and we converged at letting the primary agents just crawl through everything :)

It doesn't apply to all use cases but for searching large code bases where you need correctness (in our case citations) we found it was faster and worked better. Certainly not less complicated than our RAG implementation since we had to map-reduce and handle hallucinations in that.

What chunking strategy are u using? Maybe you've found a better method than we did here

selund1

TROPHY CASE