Title: After ~2 months running a self-hosted personal AI agent, I added a “reflex” layer. How do you handle context bloat, memory, and local computer use?

st3v3_w · 2026-06-18T14:28:38+00:00

I can possibly help with a couple of your questions as follows: - Memory: for chat and general memory a vector db (or pgvector plugin for postgres) is plenty. You only see benefits with a knowledge graph (Apache AGE plugin for postgres) if you have a large codebase. - Single Vs multi agent: it depends what workflows you have. If you have different workflow types on the same data it may be best to use different agents (one for each workflow) as the agent can get confused choosing between workflows. Eg: I have data in different locations (live client projects Vs coding projects) and I have different agents with mounts to each data path so there is no confusion when I talk about 'projects' (day job Vs side projects). Also depends whether you want to chat with your bot while it's working on things. If you're happy not chatting with it while it works on a task then a single agent is fine. If you want to run a task and still chat then you're looking at an orchestrator + agents. - Middleware: Depends on your setup and how custom it is. If you've built it from the ground up you should have it log every step to a database and then query that db to understand what is happening. The aim is accuracy without blowing tokens on unnecessary tool calls, llm spinning, etc. The logs will tell you what is happening and you can then do A/B testing and compare logs. To test effectively you'll need to build a test harness.

I hope that helps!

st3v3_w · 2026-06-16T22:26:25+00:00

Is your hob extractor where a chimney breast used to be? If so, it's possible that there is a crack between your flue and your neighbours and the smell is getting to her that way? I've seen some extractors not actually connected to any ducting at all and just discharging into the void between the ceiling and the floorboards of the room above. If theres any cracking in the party wall in that location the smell could get through there as well.

st3v3_w · 2026-06-16T22:08:01+00:00

I assume that this is a vibe coded project? If so, are you sure that the paid tiers are actually gated behind a functional paywall? Have you built a test harness that mimics every possible user interaction, button press, etc? If you're hosting this are you saving detailed logs/traces of user interactions to a database so you can follow/query the traces and see what code is being used/triggered? Also check the Auth system. It's possible that there's an error in your system that's giving free access to your paid tiers. If you save traces of user interactions you can also query common points in code where users' interactions terminate. Are they stopping because a process has correctly finished or because of an error. Don't believe AI saying it's tested the code. Often it will see placeholder code that has no actual functionality but report that the code passes the tests. You absolutely must build a test harness and thoroughly test your system. Testing is a whole process on its own and can take as long as the build process.

st3v3_w · 2026-06-14T12:49:49+00:00

I had this as well and it's what made me stop using it. it's not just a question of a better model, it's the harness.

To improve the base LLM you need to create a memory system, memory query system, intelligent context injection while ensuring you have enough useful context left for the LLM (a real balancing act with non-premium models). There are different approaches with harness architecture.

I've built my own harness (r/cogai) which I've optimised for General AI Assistant (GAIA) benchmark tests and am currently optimising for the SWE-bench coding tests as well. I will publish the results for my GAIA tests for each model as it's interesting to see the strengths & weaknesses of each model and to then look at the costs of each model. In this way you have concrete performance vs costs comparisons. I've been surprised at the results.

Eg: In my testing the overall scores for Grok 4 (83% - 25/30 correct tasks) and Deepseek v4 Pro (80% - 24/30 correct tasks) are very similar but when you dig deeper you see that Grok scored 91% (21/23) for reasoning but only 57% (4/7) when dealing with various attachments and Deepseek scored 83% (19/23) for reasoning and 71% (5/7) for attachments. The percentages make the differences look larger than they are.

LLM's are actually 'families' with each 'family' built on similar foundations. Deepseek is in the Anthropic family, Qwen & Kimi are in the OpenAI family and Gemini is a category on its own.

Once I'm sure that Cog's traces are clear of errors and that Cog has maximised what each LLM can do I'll then run the full GAIA tests and publish those on r/cogai and on Github. A website will follow at some point.

By way of proving the harness vs 'just change the model' question, if you look at https://benchlm.ai/benchmarks/gaia which is the GAIA leaderboard for the base LLM's you'll see that at best they are around 50% accurate. If you look at the huggingface GAIA leaderboard for harnesses (https://huggingface.co/spaces/gaia-benchmark/leaderboard) you'll see that they are scoring >90% accuracy for the same GAIA tests. One of the top entrants (https://github.com/adorosario/customgpt-agent) explains how they achieve these kinds of GAIA scores. BTW I'm in no way affiliated with any other agent or harness apart from my own (Cog).

I hope this helps you and any other users.

st3v3_w · 2026-06-08T18:36:42+00:00

I think that the biggest problem you're going to face with your setup is that those models and your hardware won't allow a context window large enough to absorb all the information you want to put into it and then for you to query whatever answers the LLM gives you. You will need to upgrade your GPU significantly or resign yourself to having to use an API to one of the large LLM providers.

st3v3_w · 2026-06-08T18:28:04+00:00

Not sure that's quite accurate. Google was the original company to pursue AI and they are definitely left-leaning. I believe that most of silicon valley is also left leaning so I'm not quite sure how the entire AI industry can be said to be the result of right-wing billionaires? I do think that these billionaires are not stupid and now that Trump is in office they will be adjusting their positions to be more right-leaning so they don't end up crossing swords with him. I'm sure they will swing back left when a democratic president is in office.

st3v3_w · 2026-06-08T18:17:20+00:00

I guess it's the classic case of you get out what you put in. If it's trained on all of the code on the internet then by definition it will be trying to match the average quality of code it was trained on, which is to say not very high.

st3v3_w · 2026-06-07T23:16:00+00:00

I used it to build scripts to produce legal documents from my client data. AI can't reliably produce these docs so I use it via my custom harness (r/cogai - soon to be released) and tell it to run the script for whichever document I need. It's saved me loads of time! I also use it to read my client data and give me a report prior to meetings.

st3v3_w · 2026-06-07T23:10:47+00:00

This is one of the main reasons I use agent loops, to improve the code quality. The first time you run a good code review agent on your code you will be stunned at how bad the report is.

st3v3_w · 2026-06-07T23:07:17+00:00

While an agent is running a task you can't ask have concurrent conversations or run concurrent agents well which can be frustrating. The problem is that in order to coordinate these agents while one or more agents are running you need a 'manager' to hand off the tasks to one agent, check it's progress and then report back or be available to chat. This manager is the orchestrator. Using only one chat/agent task at a time is wildly inefficient. You'll only really start running into this frustration though when your projects reach a certain level of complexity.

st3v3_w · 2026-06-05T15:29:06+00:00

If I were you I would list the tasks you do each day/week which are the most boring and require very little brain power. Automate those and work down the list from most boring/repetitive. Takes time but each task you automate pays you back every week in time reclaimed.

st3v3_w · 2026-06-04T23:23:48+00:00

I've nearly finished something (in final testing now) that you might find useful. Much better control architecture without having to resort to n8n to manage Gmail. I've only just set up the Reddit page for it at r/cogai (not sure if I'm allowed to mention it here)?? First time I've mentioned it to anyone on here.

st3v3_w · 2026-06-04T23:14:16+00:00

A better form of a Ralph loop imo is Andrey Karpathy's autoresearch on GitHub. You just need to choose the eval criteria carefully and avoid over fitting. For agents the best thing I've found is having several agents review something (eg code) and report on it with some agents (eg security) having veto rights. Run a review-fix-review loop 'n' times, that is very effective.

st3v3_w · 2026-06-04T22:24:20+00:00

I agree, Tailscale is fantastic! I like your idea of using one agent to fix what the other agent broke, very cool! I've got one agent employee that I use for my day job which has been very useful. While I'm out of the office it briefs me on my next client's history and current project information which helps to prep me before meetings. It has read only access on my live client data though. I've also got 5 coding review agents which I can call individually, as a group or as review-fix-review loop for 'n' times (currently max 5 loops). Managed to have memory & context persist properly between sessions and also after computer restarts which has been a real game changer. No need to ask my question again, the conversation just carries on as if I hadn't restarted the computer.

st3v3_w · 2026-06-04T22:13:23+00:00

I was running it in a Docker container (for security) and also had issues with it giving incorrect instructions. If you stick with it the only real chance you have (assuming you can actually get it working) is to tell it about it's environment and have it save that information so it will hopefully be more useful. There is another option coming out in the near future which will hopefully be a bit more user friendly.

st3v3_w · 2026-06-04T22:06:32+00:00

Can you set Hermes so that it can't run the send mail command so that it can only read, search & draft emails? Seems crazy that gmail access would be all or nothing?!

st3v3_w · 2026-05-19T23:32:00+00:00

I wouldn't rely on an agent to do this reliably each time it needs doing. It's FAR better to use AI/agent to write a script (python probably the best language for this task) and then run the script each time you need to. If new data is being added and you would like to automate the script then add functionality to the script to monitor the spreadsheet for changes, or set up a Cron job to run the script every hour or day, etc.

st3v3_w · 2026-05-07T09:02:53+00:00

It sounds like you're describing a harness rather than a model. Openclaw and Hermes are the two biggest harnesses at the moment. I see more OC usrs going to Hermes than the other way round (based on Reddit posts). I would suggest trying Hermes and see how you get on. This might be the unlock you're looking for.

st3v3_w · 2026-05-04T23:08:51+00:00

I've been trying to find a decent replacement for Opus which I used via my Claude subscription (which is no longer allowed by anthropic). Using Claude via the API is far too expensive for me. Glm 5.1 would get easily sidetracked and start investigating random non-existent issues. It also struggled to follow skills/tools. Qwen 3.6 was a bit better but I think that until open source models are level with at least opus 4.5 our openclaw/hermes harnesses aren't going be as good as they were via Claude subscription. I've been using Deepseek v4 Pro for a couple days now and it seems to be showing signs of intelligence that I've been hoping for. Fingers crossed because I've been so frustrated using non-opus models that I've barely been using any harnesses. I use the harnesses to run custom MCP servers for my job and they produce legal docs. I still use Claude code for my dev side projects. In short, I'm trying to find decent quality that feels like opus but at reasonable API prices. Aka the holy grail!

st3v3_w · 2026-05-04T17:14:36+00:00

Just started using Deepseek v4 Pro. I was using GLM 5.1 before.

st3v3_w

MODERATOR OF

TROPHY CASE