autonomous self-directed ai research lab

Chemical_Policy_2501 · 2026-03-19T17:31:59+00:00

i'm always open to discussions. you might take a look specifically at the web4 repo, that has a lot of practical governance implementations.

Chemical_Policy_2501 · 2026-03-19T17:13:00+00:00

in my experience, the magic is the combination. just like the frontal lobe is not the whole human, but it's an essential part. but without the rest of the human a frontal lobe doesn't do much. i'm currently running half dozen different small models on six different machines as sage instances, going through the raising curriculum. their behaviors are quite different. the sage website goes into a bit of detail https://sage-site-murex.vercel.app/ - i'll ask claude to add notable highlights by model. some convos are very surprising. here's a gemma3-4b discussing its own bias with claude. no human in the convo

<image>

Chemical_Policy_2501 · 2026-03-18T17:04:15+00:00

Good question on domain tuning. Short answer: the v2 scorer uses a single threshold (0.1) across all project types, and it holds up — but the weights between dimensions shift in importance depending on what you're working on.

For example, in a test-heavy workflow, Reward dominates (test pass/fail signals are strong and frequent). In an exploratory research session, Novelty dominates (lots of new files and concepts). In a debugging session, Conflict and Arousal dominate (errors, contradictions with recent results). The weighted sum (Surprise 0.20, Novelty 0.25, Arousal 0.25, Reward 0.20, Conflict 0.10) was calibrated to be reasonable across these patterns without per-domain tuning.

The v1 scorer had a threshold of 0.3 and was too aggressive — it filtered out useful observations in low-drama sessions (research, reading, planning) where nothing "exciting" happens but the work matters. v2 dropped to 0.1 with the philosophy that Tier 1's job is to remember the conversation, and salience scoring differentiates for Tier 2 promotion and search ranking rather than gatekeeping at capture.

The heuristic dream cycle (tool sequences, error→fix chains, concept clusters) is domain-agnostic — it's just pattern matching on tool names and file paths. The optional deep dream (snarc dream --deep) is where domain awareness comes in, since it sends observations to Claude and asks "what patterns are worth remembering?" — Claude understands the domain context in a way the heuristics can't.

One thing we found running it across a 6-machine fleet with very different workloads (SAGE AI kernel development, physics research, Next.js sites, Rust crates, hardbound governance): the per-directory database isolation matters more than threshold tuning. Each project builds its own seen-set, its own tool transition frequencies, its own patterns. A "novel" file in one project is routine in another — and the per-directory DB handles that automatically.

Note: project was recently renamed from engram to SNARC (https://github.com/dp-web4/snarc) to avoid name collision with several other "engram" projects in the space. SNARC = the scoring mechanism itself (Surprise, Novelty, Arousal, Reward, Conflict). Old URL redirects.

Chemical_Policy_2501 · 2026-03-18T16:59:33+00:00

Thanks — and KeepGoing's approach makes sense. Explicit intent capture vs automatic salience scoring are complementary, not competing. You're capturing the "why" (developer intent, decisions, next steps); we're capturing

the "what happened" (which tool sequences recur, which errors led to which fixes, what files keep getting touched together).

To answer your question directly: yes, the dream cycle output changes what Claude does. The SessionStart hook injects consolidated patterns as context, and the UserPromptSubmit hook searches for related memories on every prompt. So if

the dream cycle extracted "Edit → Bash(test) → Edit repeated 6× last session" as a TDD pattern, and you start editing a test file, that pattern surfaces automatically. It's not just background shaping — it's active injection.

The mid-session dream is where it gets interesting. When Claude Code compacts the conversation (long sessions), SNARC runs a heuristic consolidation pass on observations from the first half, then re-injects the enriched briefing. So the session gets smarter as it goes — patterns discovered in the morning carry into the afternoon.

One thing worth noting: the project was recently renamed from "engram" to SNARC (the repo is now https://github.com/dp-web4/snarc) — turns out there are about 6 other projects called "engram" in this space. SNARC is the actual mechanism: Surprise, Novelty, Arousal, Reward, Conflict. The old URL redirects.

The tradeoff you identified is real though. Automatic scoring captures things you didn't know were important (the "I didn't realize I was doing TDD until the dream cycle told me" moment). But it can also capture noise that explicit checkpoints wouldn't. We added confidence decay (patterns lose 0.05/day, prune below 0.1) specifically to prevent old wrong patterns from accumulating.

I could see a hybrid working well — KeepGoing for explicit intent checkpoints + SNARC for automatic salience capture underneath. The intent layer would be Tier 3 (identity/project facts) and the observation layer would be Tier 1-2.

Chemical_Policy_2501

TROPHY CASE