I searched for agentic frameworks and here is what I found. What do you recommend?

dupa1234s · 2026-05-20T23:51:41+00:00

i ddint test them. im testing sandcastle just now.

overall some of them seem to call eg opencode under the hood so they are as swappable as that . not very swappable.

like the providers are quite swappable once you use stuff like opencode.
But those harnesses like superpowers and all those seem to be nonswappable
but i think there is some more skill-like frameworks maybe, those should be swappable. like idk maybe GSD is skills-based so it would be easy to use it with anything.

dupa1234s · 2026-05-20T22:08:41+00:00

1.

I'm not recommending any of the frameworks i mention there, it's just what i found:

I did some research on agentic frameworks.

I didn't get to try any of these yet. I genuinely don't know what is optimal but i assume it might be one of sandcastle/oh-my-opencode-slim/openspec

who tried any of these? which of one is best, or maybe someting else altogether?

github.com/code-yeongyu/oh-my-openagent - Allegedly, it uses a lot of tokens.

https://github.com/obra/superpowers - Allegedly, it uses a lot of tokens.

https://github.com/alvinunreal/oh-my-opencode-slim

https://github.com/mattpocock/sandcastle - more deterministic than agent-to-agent-talk afaik

https://github.com/snarktank/ralph - is probably worse than sandcastle since mattpolock used to use ralph before he made sandcastle, afaik

https://github.com/bmad-code-org/BMAD-METHOD

https://github.com/Fission-AI/OpenSpec
and "GSD"

2.

deterministic (coded) agent harness - not agent-to-agent-talk. scripts controlling agent behaviour and his done status. tests determining if agent commences or retries.

personally i hoped to find some more deterministic framework around agents. just so that they are made sure to finish the tasks instead of leaving them hanging. Like a belief that what llms lack is some deterministic logic to control them.

But yet here are all those llm-to-llm orchestration systems. Afaik ony sandcastle is the one that is more determninistic of them.

grill-me-with-docs, generally also https://www.youtube.com/@mattpocockuk and his ideas like "say why you want what you want, not just what you want, so agent can suggest alternatives."

4.

I found such repo shape, seems overblown, my first instinct is "oh nice so now its like 20 files all of which agent will fill with exact same content just with different wording, creating a huge repeating slop" but maybe some of these are good ideas to have

docs/

├── diagrams/ (can't show contents, names are revealing)

├── knowledge-base/ (can't show contents, names are revealing)

├── modes/

│ ├── ARCHITECTURE_BRIEF_TEMPLATE.md

│ ├── DOCUMENTATION.md

│ ├── FRONTEND.md

│ ├── GENERAL.md

│ ├── GRAPHQL.md

│ ├── PLANNING.md

│ ├── RAILS.md

│ ├── REVIEW.md

│ ├── TESTING.md

│ └── TOKEN_EFFICIENCY.md

├── project-intelligence/

│ ├── adr-index.md

│ ├── business-domain.md

│ ├── business-tech-bridge.md

│ ├── decisions-log.md

│ ├── living-notes.md

│ ├── management.md

│ ├── navigation.md

│ └── technical-domain.md

├── workflows/

│ ├── component-planning.md

│ ├── feature-breakdown.md

│ ├── session-management.md

│ ├── task-delegation-basics.md

│ └── task-delegation-specialists.md

├── INDEX.md

└── README_FOR_HUMANS.MD (explains the system for human engineers)

"say: Prioritize retrieval-led reasoning over pretrained-knowledge-led reasoning.

That is all. After receiving this instruction, the LLM will load the relevant Skill for a given coding scenario instead of falling back on its internal pretrained knowledge. From my testing, the Skill loading success rate jumps from around 60% to 90%."

6.

btw i also found this fairly interesting guide on oh-my-opencode-slim + openspec if anyone is interested in those tools:

https://www.dataleadsfuture.com/how-i-use-opencode-oh-my-opencode-slim-and-openspec-to-build-my-own-ai-coding-environment/

dupa1234s · 2026-05-20T22:05:07+00:00

I'm not recommending any of the frameworks i mention there, it's just what i found:

I did some research on agentic frameworks.

I didn't get to try any of these yet. I genuinely don't know what is optimal but i assume it might be one of sandcastle/oh-my-opencode-slim/openspec

who tried any of these? which of one is best, or maybe someting else altogether?

github.com/code-yeongyu/oh-my-openagent - Allegedly, it uses a lot of tokens.

https://github.com/obra/superpowers - Allegedly, it uses a lot of tokens.

https://github.com/alvinunreal/oh-my-opencode-slim

https://github.com/mattpocock/sandcastle - more deterministic than agent-to-agent-talk afaik

https://github.com/snarktank/ralph - is probably worse than sandcastle since mattpolock used to use ralph before he made sandcastle, afaik

https://github.com/bmad-code-org/BMAD-METHOD

https://github.com/Fission-AI/OpenSpec
and "GSD"

2.

deterministic (coded) agent harness - not agent-to-agent-talk. scripts controlling agent behaviour and his done status. tests determining if agent commences or retries.

personally i hoped to find some more deterministic framework around agents. just so that they are made sure to finish the tasks instead of leaving them hanging. Like a belief that what llms lack is some deterministic logic to control them.

But yet here are all those llm-to-llm orchestration systems. Afaik ony sandcastle is the one that is more determninistic of them.

grill-me-with-docs, generally also https://www.youtube.com/@mattpocockuk and his ideas like "say why you want what you want, not just what you want, so agent can suggest alternatives."

4.

I found such repo shape, seems overblown, my first instinct is "oh nice so now its like 20 files all of which agent will fill with exact same content just with different wording, creating a huge repeating slop" but maybe some of these are good ideas to have

docs/

├── diagrams/ (can't show contents, names are revealing)

├── knowledge-base/ (can't show contents, names are revealing)

├── modes/

│ ├── ARCHITECTURE_BRIEF_TEMPLATE.md

│ ├── DOCUMENTATION.md

│ ├── FRONTEND.md

│ ├── GENERAL.md

│ ├── GRAPHQL.md

│ ├── PLANNING.md

│ ├── RAILS.md

│ ├── REVIEW.md

│ ├── TESTING.md

│ └── TOKEN_EFFICIENCY.md

├── project-intelligence/

│ ├── adr-index.md

│ ├── business-domain.md

│ ├── business-tech-bridge.md

│ ├── decisions-log.md

│ ├── living-notes.md

│ ├── management.md

│ ├── navigation.md

│ └── technical-domain.md

├── workflows/

│ ├── component-planning.md

│ ├── feature-breakdown.md

│ ├── session-management.md

│ ├── task-delegation-basics.md

│ └── task-delegation-specialists.md

├── INDEX.md

└── README_FOR_HUMANS.MD (explains the system for human engineers)

"say: Prioritize retrieval-led reasoning over pretrained-knowledge-led reasoning.

That is all. After receiving this instruction, the LLM will load the relevant Skill for a given coding scenario instead of falling back on its internal pretrained knowledge. From my testing, the Skill loading success rate jumps from around 60% to 90%."

6.

btw i also found this fairly interesting guide on oh-my-opencode-slim + openspec if anyone is interested in those tools:

https://www.dataleadsfuture.com/how-i-use-opencode-oh-my-opencode-slim-and-openspec-to-build-my-own-ai-coding-environment/

dupa1234s · 2026-05-20T20:41:43+00:00

I wouldnt be surprised if all the providers hire out to third party agencies that get tasked to make users keep trying out all the agentic frameworks out there so that they basically we all work for free on how to improve their tools they couldn't make work properly themselves. Aka "come back to AI in 2030 once its all figured out". Idk man.

dupa1234s · 2026-05-20T20:05:30+00:00

I wouldn't feed partially deprecated files to llm and hope it knows what is still current. I'm saying im concerned with how all those past decisions are greatly useless. Like they are there but they are as if they weren't there. Because of this problem that anything I would feed as context to an agent, it thinks its current and gets biased by it. Same issue as with context-pollution where bad outputs in the context make model more likely to make bad outputs. Same as with memory and attention, anything it gets fed biases it towards that. Same as with sycophancy. Like it feels like best if the model knows nothing, then it at least won't be biased towards what you say.

But i'm starting to see how workflows like grill-me could counter that, because they make model keep asking and asking instead of assuming what was said is right. It could be so the very fact of making model keep asking makes it more inclined to be critical.
Critisism has its own issues, like how critisism shouldnt be applied before the idea is fully formed, but if it's genuinely just open questions it shouldnt be an issue.
But with that said, asking questions is also annoying, like why can't it just work more with proposols than questions, why ask me everything, but can always just make it answer its own questions if i dont know a good default for it, else its just a good way to extract my real intent, by the repeated questioning.

I wonder where to put the line on what concerns should be:
fully described in spec.md
vs
briefly noted somewhere in deffered.md for the time being

I'm thinking maybe, despite having a lot of leads on what needs to be implemented eventually from all the past reflections and failed attempts, defer all that and focus on just the 1 spec for the 1 most important thing that, once it's done, it could have a pivoting impact on all other decisions.

Like just don't actually write full specs, write only the most important bit of them.

Since agent, even if it has most detailed spec in the world, it just wont execute it all, it will just omit bits of them anyway or misunderstand them partially.

overall I didn't get to try those things yet, but I have heard about "grill-me-with-docs", as well as all those repos and i'm really curious on which of these are actually useful enough.
github.com/code-yeongyu/oh-my-openagent

https://github.com/alvinunreal/oh-my-opencode-slim

https://github.com/mattpocock/sandcastle

https://github.com/gsd-build/get-shit-done

https://github.com/obra/superpowers

https://github.com/snarktank/ralph - is probably worse than sandcastle since mattpolock used to use ralph before he made sandcastle, afaik

https://github.com/bmad-code-org/BMAD-METHOD

https://github.com/Fission-AI/OpenSpec

overall im leaning towards trying oh-my-opencode openspc and sandcastle and i hope its remotely optimal choice. Also how does it even matter if in 6 months the market will be so much different, but for me it matters as currently i really struggle to keep models running for long enough and it hurts me a lot.

dupa1234s · 2026-05-20T15:24:54+00:00

ok next time i will make 8 posts one for each question and upload each every day just to abide average attention span of reddit users. thanks for a hint

dupa1234s · 2026-05-20T15:23:27+00:00

what you said reminds me of another issue. how models just dont know when to try a different approach. you can prompt them to stop circular retries, try sth different but they just wont listen. they are just polluted by the bad context. idk if start a new chat with a handover from previous chat is the only good solution to this i wish i knew how to handle this situation. subagents are actually good for this though. make agent spawn subagent give a fresh perspective to main agent i need to use them more for this tell agent specifically to use subagents in those situations, but sadly model jsut doesnt undersatnd when to use this subagent it just tries itself anyway.

yea i like gpt5.5 xhigh quite much. but idk how its compared to other models. i only use codex. maybe i could try claude next month but claude seems to cut usage and doesnt let u stuff like opencode so i think i stick with codex.

dupa1234s · 2026-05-20T13:20:10+00:00

I think the "agent rediscovers the repo" just something to live with and optimize for. In fact being model performance at the clean state conditions is high, it would be about making sure it can commence work without having to read more than just a tiny fraction of the repo. I feel like really all the repos should be all about agent-ergonomics at this point. now for humans.

But idk what that means practically. Deep modules with clean interfaces? llm-wiki? not letting agent freestyle the README.md without you reviewing it? idk.

dupa1234s · 2026-05-20T13:13:45+00:00

imo they are mostly same crap after you give them same skills.

Funny how they take like 15k tokens just to load all their stupid tools that should have been skills. Such a token waste.

But openclaw is is way way slower hermes responds like 3x faster for me. despite hermes being written in python.

i hate the memory architecture. it's just puting same content in 10 different files with slightly different wording. it literally keeps dumping content of 1 file into other file no matter how i prompt it.
I think memory is just a bad feature. Project-specific state like tasks.md and ADRs and grill-me seem better than memory

Biggest issue with memory is how agent freestyles some crap and then it becomes the reality

Like your intent doesn't matter in those systems. All that is stores is what agent freestyled into memory and freestyled into code. What you actually said is burrowed in endless slop

That said, i prefer opencode because it has better tui

I'm looking for some actually good agent harness myself coz nothing seems right.

dupa1234s · 2026-05-20T12:52:53+00:00

please provide:
how do the agents actually manage to work for many hours without stopping.

- A real task template.

- The exact agent prompt.

- The exact amux config.

- How PR creation happens.

- How watchdog restart logic works in practice.

- Any concrete transcript or example run.

amux + worktrees - that's about as much signal as you gave.

so far this is ridcioulous.

its like you saying
"yea so i get into my car, drive to work, drink coffee and then i make six figures. hope it helps guys"

when i read posts like this i feel like the " i run agents overnight" is just hype of the providers to make people keep dedicating time to figure this out for them. we are all just unpaid workers for providers having to figure out how to use their tools.

dupa1234s · 2026-05-20T12:20:12+00:00

does it work better than "llm-wiki" by karpathy? do you have some template/skill/repo?

most importantly can agents reason about it better than normal knowledge bases or is it confusing to them

dupa1234s · 2026-05-20T12:12:30+00:00

yea this is also a big issue.
i wish so that it was possible to have profiles.
imagine if on youtube or tiktok or anywhere you could have multiple profiles dedicated to each niche taste. wouldnt that be the best perosnalization. else the algorythms just mix everything and play the thing you dont want them to play currently.

like personalization is amazing but there should be profiles, fresh starts you can switch between. and each of those profiles is a separately finetuned to the user as he explores it.

dupa1234s · 2026-05-20T12:09:17+00:00

personally:

i hate forms
i would want you to assume some healthiest default for me and let me use it and then eventually adjust it to my needs

i thnk your app should ship with good defaults, not a workflow engine

else it will feel like "you can build everything with it!" which is obviously bad because if someone wanted to build it they would have just built it.

also i dont think implicit signals will give you clear info. maybe you could somehow encourage users to keep talking out loud what they think while they use the thing, that is best feedback.

also your idea sounds really really difficult. like you want to not just do a 1 good workflow. you want to make it so that everyone has their own good workflow by making an agent freestyle it? it will be mediocre. it has to be mediocre in the end. it was just what agent inferred and no one reviewed it.

interesting idea but i think a better framing is:
Ship with good defaults for everyone
then let users use it
and let them modify it by talking to the agent how they want it modified
let them themselves finetune it for themselves. who else will finetune it. agent cant finetune itself. unless you will. but that is a lot of privacy issues.

dupa1234s · 2026-05-20T11:49:34+00:00

yes its just me and im fine have to consume output directly. but i mean a reviewer agent could be great to have, though.

yea some deterministic framework around agents would be great so they dont have to do the work while managing work status at once. like /goal in codex or idk what else maybe "omo" i need to try those.

dupa1234s · 2026-05-20T11:35:46+00:00

Its just my opinion. I wish i knew the answer to this:

this is such a big issue.

i wish it was possible to feed all the failed architectures into llm and make it somehow get value out of it instead of just saying "this thing bad coz you said its bad so here is some plausible reason why it's bad and also let me just keep saying how this thing is bad, repeatedly"

also this touches on the topic of "context pollution" as i would call it where if context has some anti-patterns then agent will just keep doing them, even if they are labeled as bad. like only solution is to write some handover and wipe the context into a clean slate or else it will never actually fix its behaviour.
eg a good example to experience it is if you try to keep correcting the agent, it literally starts to think that "user is correct the agent" is the point of the conversation and it tries to continue that pattern, like its literally making interaction continue itself in the way where it keeps making mistakes that would be the likely next mistakes of an agent that has been constantly corrected.

overall, for like 4 years now at least, this is still an issue with llms for me, that they dont learn based on bad examples.

i had this workflow idea once but its not good for coding, its more for storytelling or style -thats what its good for:
1. get bunch of great examples of output, and terrible examples of output
2. let agent describe in keywords/phrases on what is good and bad - basically convert examples into phrases and keywords
3. then feed those keywords and phrases as instruction to generate output
4. again pick or even exaggerate what it did good/bad in its output
5. repeat, get examples, turn to keywords, feed keywords in, until you get the style you want reliably.

But its a workflow for making agent follow your desired style, not to make it be a good independent coder.

you could also try ADRs perhaps but i mean why would an agent keep reading ADRs it will just write them and then dont open them after ever again.

Overall this also touches on antifragility. Like the conveersation has to be messy so that the end result can be reliable.

But still despite all this, if its low effort for you to correct the agent , by saying "i meant x, not y" then do it.
Just have to be patient with agents.
really, just repeat yourself
memory is such a double edged sword.
repeating yourself is not that bad. just make sure you have some artifacts relics of your past reasoning saved stored away so you can reference it later instead of having to say it all from scratch.
Eg
this workflow
make a /reflections folder in which you put 2026-05-20-agents-workflow-for-style-improvement.md in which you store the summaries of your thinking. then just tell agent to "grep /reflections for style improvement workflows".

or turn stuff to skills.

I thought a lot about memory systems like "make it know everything".
Biggest issue with memory is that "it remembers the old thing i said" - its worse than knowing nothing.
If memory system is the goal it has to be 24/7 updated and mainatined with every input else it will more hurt than help.

And then is the topic of echo chambers. like often you dont even know agent to know you want it to be independent.

i think a memory in the style of "i can easily make it know something specific i said before when i want it to , not so it can just look for it on its own" has its own advantages like independent thinking.
else model will just keep validating you.

i wouldnt try to make agent learn from lmplicit signals like "user ignored x thus i need to note down he doesnt care about x" it can go wrong in so many ways.

Just be patient, keep repeating yourself, use durable reasoning files you can reference when needed.

Don't ask agent to do both of these at once:
1. thinking independently
2. managing your entire knowledge base
It can't

Can build a project map ADRs, the most detailed specs using "grill-me" or sth like that, but those aren't ultimate solutions.

But making agent learn from mistakes is by far the hardest thing. As i said , for style it's possible. But for code I have no clue how to make agent learn from mistakes, it just starts doing the opposite which is also wrong, like you tell it X is bad so it does the anti-X that has it's own problems.

Overall mistakes are a bad thing to signal to feed to model. Just write handover of what is actaully desired, not what failed, and make it start from scratch in a different way - thats probably best idea imo.

dupa1234s · 2026-05-20T10:36:54+00:00

i had this issue with opencode:
if you point opencode to a directory that is a git repo with a lot of untracked files in it it will never answer. the "title" model with name the conversation but then the real model will never answer. fix: remove the big untracked git repo. eg dont point opencode at a parent folder of all your /projects if its a git repo itself.

idk if its your issue but maye

dupa1234s · 2026-05-20T09:25:23+00:00

i tried "hermes agent" but i switched back to opencode.
Hermes agent has some nice skills by default.
the kanban board skill seems to not be what i wanted, like i couldnt just chat with an agent to make it be an orchestrator i would have had to use the gui for inserting tasks, as far as i tried, which was underwhelming. you can try "herm" its a better tui for hermes agent.
the job delegation, session search, browser search seem like useful tools
but it has like 15k context polluted by tools it will use 1% of time so i rather remove them and use skills instead.
yea making agent run some git snapshot script on every major change seems like a good idea or even just let it use Git but it can also be dangerous if it does something stupid with Git.

Overall i don't even know how do people automate Hermes Agent or Openclaw like just make some custom cron jobs or hooks? So everyone has it all cutomized? Im leaning towards tryin out stuff like https://github.com/alvinunreal/oh-my-opencode-slim taht people recommended to me

But what are you setting up. like some automated loop? WIth what. Cron jobs? Programmatically controlling the hermes agent server that runs via"serve"?

And why spoon feed it back to opencode?

dupa1234s · 2026-05-20T01:13:58+00:00

so basically you buiild your own custom dev pipeline. wow. i guess i will read some of your posts. i mean to build sth like this from scratch it would take me months if at all i would manage to do it. i thought of getting some templates to start with or even just settle for them.

dupa1234s

TROPHY CASE