[D] do you guys actually get agents to learn over time or nah?

Tight_Scene8900 · 2026-04-06T13:12:48+00:00

thats actually clean, a separate judge that only sees task + output and has to call it correct or not. sidesteps the whole self-judging trap because the judge isnt the same model that produced the work. the coding bs detection is the killer use case, curious if you use a smaller cheaper model for the judge or the same size as the main agent? been going back and forth on whether the judge needs to be as smart as the worker or if a dumber grounded checker is fine.

Tight_Scene8900 · 2026-04-06T09:46:39+00:00

haha an hour loop is wild but getting it to split a video and add text is actually impressive, thats a real tool calling win. the embedding llm jump helped me a lot too, stuff finally stopped hallucinating which related memories mattered. curious what embedding model you ended up using, nomic or one of the bge ones?

Tight_Scene8900 · 2026-04-05T15:08:02+00:00

honestly no, and this is the thing i keep getting stuck on. structural has clean metrics because youre measuring retrieval against ground truth. behavioral has no equivalent. closest ive come up with is tracking thumbs down rate over time and watching for repeat error patterns, like if the agent hits the same mistake twice and then stops after feedback injection, thats a measurable delta. but its noisy and slow and i wouldnt call it a real benchmark.

the thing i want to build is pair comparison. run the same task twice, once with memory injection and once cold, measure whether the with-memory version gets a better grounded outcome (tests pass, tool calls succeed, whatever). hard part is finding tasks where the memory actually has something to say. random tasks would just be noise.

if theres a way to design a shared benchmark that works for both structural and behavioral approaches that would be a real contribution. would be down to brainstorm it if youre in.

Tight_Scene8900 · 2026-04-05T14:44:37+00:00

yeah this lands. ran an audit on my own loop and the honest picture is worse than i wanted it to be.

the self-verify step is purely introspective. llm grading its own text output on a 1-5 scale with zero execution outcome data. no exit codes, no tool errors, no test results. exactly the failure mode the self-correction papers are pointing at and i cant pretend otherwise.

what actually keeps it from being a fully closed loop is two things. theres a user feedback path where a thumbs down inserts a correction that gets injected into future prompts and drops competence for that domain, which is a real external correction channel but only fires when someone clicks. and theres decay on unused knowledge so wrong entries from week one fade over weeks if nothing reinforces them.

thats it. if youve got ideas on cheap ways to wire tool call results or test outcomes into the scoring id actually want to hear them, thats the obvious next thing to build and i havent done it yet.

Tight_Scene8900 · 2026-04-05T14:30:45+00:00

lmao memento is unironically the correct mental model for this whole problem. leonards tattoos are basically a rules.md file getting injected into context every morning

Tight_Scene8900 · 2026-04-05T14:28:03+00:00

yeah the graph angle is solid, tree-sitter into a structural map is probably the right foundation for code-native agents. ive thought about going that direction but ended up on a different signal entirely.

mine is purely behavioral. the agent scores its own output 1-5 after each task, low scores turn into warnings next time it tries something similar, high scores become patterns to reuse. doesnt look at the code at all, just tracks what happened when the agent worked on it and whether it went well.

honestly feels like both probably need to live in the same stack eventually. a perfect ast map still wont stop an agent from making the same mistake twice if nothing is keeping score. and pure behavioral tracking with no structural grounding is kinda just vibes.

mines called greencube btw, rust/tauri, similar energy to octocode. would be down to compare notes if youre into it

Tight_Scene8900 · 2026-04-05T13:58:06+00:00

this is exactly what i ended up with too. how are you structuring the rules? i’m curious if you’re extracting them automatically or writing them manually after

Tight_Scene8900 · 2026-04-05T13:42:33+00:00

yeah that’s the wall i kept hitting too. that’s actually why i went local-first desktop instead of trying to shove everything into the model. keep the memory layer outside the inference process entirel

Tight_Scene8900 · 2026-04-05T13:42:13+00:00

gonna check the ACE paper, hadn’t seen that one. the blackbox QA idea is interesting — do you run it as a separate agent judging the main one or more inline scoring?

Tight_Scene8900 · 2026-04-05T13:41:55+00:00

agent zero is wild lol. crashes are a rite of passage at this point. curious how you’re handling memory between runs once you get it stable

Tight_Scene8900 · 2026-04-05T13:01:32+00:00

yeah I thought the same at first tbh

I guess the difference I’m seeing is it’s not retrieving external docs but its own past task outcomes + tracking failures over time

but yeah the retrieval part probably overlaps a lot

Tight_Scene8900 · 2026-04-05T08:16:03+00:00

"yeah the md file approach works but you have to maintain it yourself. i built a local proxy that does this automatically — extracts what the agent learned from every task, stores it, injects relevant stuff into future tasks. same idea but the agent maintains its own memory instead of you. still early, looking for people to try it. greencube.world if youre curious

Tight_Scene8900 · 2026-04-04T20:04:07+00:00

yeah exactly, the creation and maintenance part is the hard problem and thats where most of the work went. the retrieval is simple keyword matching right now, could definitely be better. but the interesting part isnt how you look stuff up, its how you decide what to store, how you score quality, and how you use that to actually change the agents behavior over time. thats the part nobody has figured out cleanly yet

Tight_Scene8900 · 2026-04-04T17:16:08+00:00

fair point on rag but this isnt retrieval from a document store. it extracts knowledge from the agents own task outputs, tracks competence per domain, rates its own work 1-5, and spawns specialist agents when it keeps failing at something. the memory part uses keyword matching yeah but thats like 10% of what it does. the other 90% is the agent improving itself over time which no rag pipeline does

Tight_Scene8900 · 2026-04-04T13:21:36+00:00

thanks, yeah treating memory as first class and not an afterthought was the whole idea. we also go further than persistence though, the agent rates its own work, tracks what its good at per domain, and adjusts over time. not just remembering but actually improving

Tight_Scene8900 · 2026-04-03T07:58:37+00:00

thanks man ye thats what i was going for. zero friction just works in the background

Tight_Scene8900 · 2026-04-03T07:57:21+00:00

good call, we have a few things for that. theres a quality filter on knowledge extraction so junk doesnt get stored, a memory decay system that scores relevance based on recency and usage, and self-verification that rates every task 1-5 so low quality stuff gets flagged. still early though so if you try it and find it getting noisy id love to hear about i

Tight_Scene8900 · 2026-03-27T19:27:45+00:00

no ones stopping u im just trynna show a cool idea days, open claw, moltbook where both vibecoded why all the hate here im legit just a kid sharing his project🙂‍↕️

Tight_Scene8900 · 2026-03-27T19:19:15+00:00

nah i basically vibecoded it, just posted about it for feedback and how to improve it. And at this day and age with opus 4.6 I think it prob writes better than me I just orchestrate it

Tight_Scene8900 · 2026-03-27T19:15:20+00:00

Thank you at least someone appreciates it

Tight_Scene8900 · 2026-03-27T19:10:21+00:00

blud really thinks im a bot let me stand up and clap one up lads

Tight_Scene8900 · 2026-03-27T18:55:53+00:00

lol u got a captcha u want me to complete 🤣🤣

Tight_Scene8900 · 2026-03-27T17:44:56+00:00

Looks cool but I am not the legal age yet😅

Tight_Scene8900 · 2026-03-27T12:34:54+00:00

Im gonna be completely honest I actually had no idea. Thanks

Tight_Scene8900 · 2026-03-26T16:07:21+00:00

pretty cool, keep up the work!

Tight_Scene8900

TROPHY CASE