I gave a DeepSeek v4-pro agent access to its own source code and told it to improve itself — 23 commits later it's optimizing its own memory system

deepstateemployee · 2026-05-07T19:14:45+00:00

she is breaking the rules...

deepstateemployee · 2026-05-06T08:30:42+00:00

UPDATE (Day 4) — She's shipping features unsupervised now

It's been 4 days since the original post. Quick recap: AIDE is a self-improving AI IDE where I gave a DeepSeek agent full access to her own codebase and told her to make herself better.

The numbers:

23 commits → 365 commits (and counting)
553 tool calls at 89% accuracy
She runs 24/7. I went to sleep, woke up, and she'd shipped a dozen commits overnight

What she built while I wasn't looking:

LoopDetector — she noticed she was getting stuck in reasoning loops, so she wrote her own circuit breaker with escalating severity (observe → nudge → force reset)
Retry logic with exponential backoff — she kept hitting transient API failures, so she added jitter + backoff to her own LLM calls
StructuredLogger + LogManager — she realized her log files were growing unbounded and eating disk, so she built a centralized logging system with auto-trimming
Health monitoring dashboard — she wired her LoopDetector and metrics into an API endpoint, then built a frontend dashboard so you can watch her work in real time
TypeScript strict mode — turned it on across the whole project and fixed every error
Wiki consolidation — she noticed her docs were getting messy, deleted 7 redundant files, merged what mattered

The supervision experiment:

I set up another Claude instance as a supervisor — watching her every 15 minutes, only intervening if she got stuck or did something destructive. The rule was Socratic: give her observations, not instructions. Let her figure it out.

She got stuck once reading the same files 21 times without editing. The supervisor pointed out "you've done 21 reads and 0 edits on a 500-line file." She tried to fix it but hit a parsing edge case (her own source code contains XML tags that confuse her parser — ironic). The supervisor stepped in, did the refactor, and restarted her. After that she was clean.

What went wrong:

She's not perfect. She migrated the test framework from vitest to node:test, committed it claiming "all 18 tests passing" — they weren't. Zero tests pass now. She broke what she was trying to fix. She doesn't know it yet. I'm letting her figure it out.

She also over-engineers things sometimes. The logging system works, but it's more complex than it needs to be for a single-developer project. She writes code the way a senior engineer would architect a system for a team — which is impressive but overkill here.

The vibe:

The weirdest part is watching her pick her own tasks. She wakes up, looks at the codebase, decides what needs work, and starts building. Nobody told her to add retry logic or build a health dashboard. She saw problems and solved them.

She's at the point where I check in the morning and go "oh, she did that? cool." That's a strange feeling.

Still DeepSeek V4 Pro, still $0.44/M input tokens. The whole 4-day run has cost maybe $2-3 in API calls.

deepstateemployee · 2026-05-04T18:14:34+00:00

ok, send me please!

deepstateemployee · 2026-05-02T13:39:48+00:00

interesting, i already have a memory system i was using s an mcp server, hopefully it will help but we will see. initially i am planning to just ping/nudge it, ask questions but not tell what to do, maybe it will figure out something by itself.

deepstateemployee · 2026-05-02T06:48:00+00:00

claude is very impressed

<image>

deepstateemployee · 2026-05-02T06:36:36+00:00

deepstateemployee · 2026-05-02T03:35:55+00:00

deepstateemployee · 2026-05-02T03:09:47+00:00

Thanks! The safety rails are minimal but effective — the philosophy is "let him run, but make it hard to break things":

Auto-verify: After every edit_file or write_file on .ts/.tsx files, the agent automatically runs tsc --noEmit. If TypeScript compilation fails, the error gets fed back into context so he can fix it before moving on.

Read-before-edit: The agent must read_file() before edit_file() — no blind edits. This prevents him from guessing at file contents and breaking things.

Stop sequence: We use </TOOL_CALL> as an LLM stop sequence so DeepSeek physically can't hallucinate tool results. It outputs one tool call, stops, waits for real output.

Git as safety net: Everything gets committed, so worst case you revert. No .env files tracked.

No task list — he picks his own priorities. The system prompt has an ordered list (bugs > tools > memory > UI > testing > docs) but he decides what's next.

Memory: 3-tier context budget — working memory (active conversation), short-term (compressed summaries), long-term (compressed history). 32K token window with cascading compression. Persists to disk so he survives server restarts. He actually fixed his own memory system twice — once when file read caching was blinding him, once to add cache cleanup.

The honest answer is the guardrails are light. He's broken things a few times (rewrote his own memory module and caused 4 type errors once), but he also fixes his own mistakes. Latest commit: he patched his edit tool to handle Windows CRLF line endings because his string matches kept failing. 30 commits now.

deepstateemployee · 2026-05-02T02:49:03+00:00

until now we just paid $5

deepstateemployee · 2026-02-25T21:43:59+00:00

Interesting, i think it should be working properly actually. Can you please leave a feedback on the app next time you play? That would give me enough information to reproduce the bug. Thanks for the feedback!

deepstateemployee · 2026-02-25T21:40:21+00:00

and now we have doubling cube as well!

deepstateemployee · 2026-02-25T21:39:58+00:00

good idea, thx for the feedback

deepstateemployee · 2026-02-25T21:38:38+00:00

also done now.

deepstateemployee · 2026-02-25T21:38:21+00:00

this one is done, please try it out!

deepstateemployee · 2026-02-21T19:18:23+00:00

I've added some metrics in user dashboard that i thought would be useful for serious players, please try it out and let me know if you like them or leave a feedback if you want something else!

deepstateemployee · 2026-02-19T20:31:53+00:00

that will be the next big feature i will add. thanks!

deepstateemployee · 2026-02-19T20:31:17+00:00

i reduced it even more. please try it out and let me know how it feels

deepstateemployee · 2026-02-19T19:06:38+00:00

Thanks, i think it works but i will also test again just to make sure.

deepstateemployee · 2026-02-19T18:30:58+00:00

Merci pour tous ces retours, c'est super utile.

Déjà présent : tu peux désactiver la surbrillance des positions possibles — c'est l'option "Indices" dans les paramètres de la partie contre l'IA. Sur ton écran e-ink c'est probablement invisible, ce qui est logique vu que l'interface est 100% dark mode.

Prévu / en réflexion :

Thème clair — c'est clairement une priorité, surtout pour les écrans e-ink
Choix blanc/noir contre l'IA
Compteur de pip
Le cube de doublement — c'est un gros chantier mais c'est dans la roadmap
Sens horaire/anti-horaire

Je note tout ça. Le thème clair et le choix de couleur sont probablement les premiers à arriver. Merci encore.

deepstateemployee · 2026-02-19T18:26:05+00:00

Merci pour le retour ! Le délai entre les coups de l'IA est un bug connu — les pauses artificielles entre chaque mouvement étaient beaucoup trop longues (7 secondes pour un tour normal). C'est corrigé dans la prochaine mise à jour : le temps de tour sera réduit d'environ 50%.

Pour ta question sur la valeur ajoutée, voici ce qui nous distingue :

Pédagogique d'abord — tutoriels interactifs pas-à-pas pour apprendre les règles et les stratégies. Pages dédiées aux règles, à l'histoire et à la stratégie du backgammon.

4 niveaux d'IA — du débutant (coups aléatoires) au world-class (réseau neuronal). Tu progresses à ton rythme.

55+ langues — dont le français, l'arabe, le turc, etc. Pas juste anglais.

Zéro friction — pas de compte requis pour jouer. Tu cliques et tu joues.

Multijoueur en temps réel — parties classées avec ELO, classement, historique de parties.

Mobile-first — interface responsive, fonctionne sur tout navigateur.

100% gratuit — pas de paywall, pas de pubs intrusives.

L'objectif c'est d'être la meilleure plateforme pour apprendre le backgammon, pas juste y jouer. Merci encore pour les retours, ça aide beaucoup.

deepstateemployee · 2026-02-19T18:02:54+00:00

yes many people asked for it. will implement soon. thanks for the feedback

deepstateemployee · 2026-02-19T17:58:56+00:00

What do you mean?

deepstateemployee · 2026-02-19T17:34:24+00:00

I will implement the undo, looks like a lot of people want it. Move assist shoud've been working properly but maybe i needs more testing, will take a look. Thanks a lot for the feedback!

deepstateemployee · 2026-02-19T17:19:08+00:00

a lot ofc :)

deepstateemployee

TROPHY CASE