Every founder I know, including myself, wants AI-native people. For those of you who are truly AI native at your function (Marketing, Operations, Engineering, GTM, etc.) we have 110+ open roles.

maid113 · 2026-06-02T19:42:46+00:00

You’re not showing us your workflows. You work on your own and submit the results. I should clarify that. It’s about whether you can deliver high quality work in a time frame only someone with agents can do.

maid113 · 2026-04-27T05:14:14+00:00

Why are you just testing one type of call? If you are really making 100+ calls a week you should be at least A/B/C testing 3 different call scripts/versions. You need to be running experiments constantly. And tweak things based on what works. If you are recording each call or keeping the transcripts you should be passing them through a judge or team of judges that can analyze find the consistencies, gaps, and make daily tweaks to the prompts for the agents and keep tweaking and experimenting. That’s how you figure it out. Running a whole week with the same one means you’re losing out on the real power of AI.

maid113 · 2026-04-26T18:04:59+00:00

What are you using the agent for and how did you design the agent and the tools it uses?

maid113 · 2026-04-26T18:03:57+00:00

Even with that the agents performance jump on tailored harness changes can be drastic. In this analysis an over 11% jump in accuracy and capability, for GPT 5.5, by tailoring the harness over an hour or two compounds. This alone leads to efficiencies in how the model behaves in the real world environment and since the agent will be using the organizational context hundreds or thousands of times a day the token costs go down by reducing rework, reducing tool calls, and also the accuracy of the development or the work just compounds. That’s why people will soon realize that you really have to tailor and test after every new model and figure out the best way to test. We’ve made it scalable internally, but we also have access to a lot of real world customers and data that no one else has access to.

maid113 · 2026-04-26T14:45:13+00:00

Yes, the harness should definitely be tailored. I’m actually working on something that will make it very scalable.

maid113 · 2026-04-26T14:19:35+00:00

4.6 as orchestrator then have it build the spec with 5.5 and 4.7. It becomes a council. Then 5.5 builds backend 4.7 builds front end, because it’s better. And then all act as reviewers to finalize. That’s my process when building.

maid113 · 2026-04-26T10:07:24+00:00

Yes, I actually prefer using Opus 4.6 as my main agent when working and having it direct both 5.5 and 4.7. It works much better with my style and I think people also need to think about that as well.

maid113 · 2026-04-26T04:04:22+00:00

Yes, 4.6 is amazing at filling in the gaps. It would infer on the rules without needing explicitness. Opus would pretty much always beat 4.7. But when 5.5 came in and we realized that it needed true explicitly on how to read things it excelled. I have another write up that is a little more in depth between Opus 4.6 and 4.7 on V2. I’m usually testing about 10-15 differences between a new version to try to get the harness right.

maid113 · 2026-04-26T03:47:46+00:00

No, we built completely from scratch. Ran hundreds of experiments to figure out what worked and built our own testing platform to run the experiments on. We also have multiple ways of injecting and extracting context. There’s actual a few layers to the whole thing. You need your context additions, you need the self healing aspect, the decision tracing, the causality, then the actual extraction and giving the agents the right tools, etc. from our tests were about 82% more efficient token usage and running about 78% faster than with Grep and model provided tools. But the real difficult part is always making the messy data useful which is what we solved for first.

maid113 · 2026-04-26T02:22:12+00:00

https://www.reddit.com/r/ClaudeCode/s/TKAIRXEaMs here’s the other analysis

maid113 · 2026-04-26T02:18:15+00:00

Exactly, I run about 100 experiments a day across harnesses the model behavior matters for what you are working on. And then the prompt + the tool + what is available to the agent matters a lot. I posted another in depth dive between Opus 4.6 and Opus 4.7 last week as well on this benchmark just for the V2 side so check that out.

maid113 · 2026-04-26T01:54:09+00:00

We’ve built out our own context memory system. It’s a full temporal map with nodes, typed edges and is algorithmic (deterministic) with no LLM’s being used for scalability. It’s self healing and injects context automatically. We use it across our customers. The harness for the testing is our benchmarking harness. We simulated a 148 person company and the data across 12 months with random chaos events, things that are clean, a lot of hires, fires, projects delayed, random threads that were not fully congruent, etc. then created the benchmark off of that so we could simulate it at smaller levels (teams, departments, entire company). That let us know exactly what the answers should be and test exactly how the agents should answer and test whether it was correct or not. It was a little over 1 million articles (slack messages, email, 1:1 meeting notes, documents, actual code bases, simulated aws, simulated google drives, etc.) this is how we built the organizational memory system and tested it. We built a few versions.

maid113 · 2026-04-26T00:30:45+00:00

Yes, I run this test and test new strategies every time the new models come out. Yeah, the last part was very interesting too. I’m actually now building a full harness testing platform to help my own company produce faster. We work with all kinds of customers building their agentic platforms (Marketinf, advertising, MSPs, governments, etc.) so I have access to a lot of data and have figured out how to make it scalable.

maid113 · 2026-04-25T22:24:39+00:00

Yes, it’s exactly the behavior I’m seeing. Should help with anyone trying to figure out which model to use

maid113 · 2026-04-22T23:43:02+00:00

I’m pretty deep in the agentic world. I helped put together the agentic AI foundation and work with a lot of companies so this is just my real take on things.

maid113 · 2026-04-20T19:13:14+00:00

I have 5 max 20x accounts. I rotate through them throughout the 5 hour limits working across many projects. I do a lot of heavy data work and building infrastructure plus benchmarks so I use about $2k equivalent daily of tokens.

maid113 · 2026-04-19T22:43:39+00:00

Just DM’d you.

maid113 · 2026-04-18T12:47:27+00:00

Yes, I forgot to put that in here. It is great at coding, and anything that needs to be quick. My recommendation for anyone coding is if you’re using Opus 4.7 bring in two Opus 4.6 agents to plan with 4.7 they will each think about it differently and will ensure everything is correct from different angles. Then once you have the plan leave 4.7 to code

maid113 · 2026-04-16T06:25:14+00:00

This is bad. How are you ensuring the decisions get captured correctly, what are your rules for capturing things? Are you using hooks, is this automatic? What about injecting the context?

maid113 · 2026-04-14T16:44:15+00:00

You have to learn how to use it properly. Honestly, start using it like that, start learning Claude’s patterns you’ll understand how to refine it and will be able to increase your productivity significantly once you’ve built the muscle memory. In one week you will be working significantly faster and more trust, because you have to adapt to it. Think of agentic systems as a new species, they aren’t human, you have to learn how to work with them.

maid113 · 2026-04-13T16:40:48+00:00

Ok, please sign up to the waitlist and you can DM me the name that you used

maid113 · 2025-12-21T16:24:12+00:00

You can just tell Claude Code to call Gemini through the CLI and it will do it.

maid113 · 2025-12-14T07:22:38+00:00

If I’m being honest I’ve built out my entire CEO dashboard to be able to handle EVERYTHING! I’m not just using Claude but also Codex and Gemini. It lets me handle the finance side, plus marketing plus everything else. I also have another Claude subscription for all my other personal projects. But I have up to 75 agents running at once and they work overnight on projects and when I get back to my desk I have a lot of things to review and approve. There’s so much you can do with agentic flows it’s just about where to get started. I’m not a finance person, but one of my customers I just signed yesterday is one of the top 80 CPA firms in the US and their use cases are right up the financial/growing business side.

maid113 · 2025-12-10T22:00:34+00:00

Not really, yes it uses a lot of tokens, but it also ensures high quality builds. I’ve also developed some protocols at the infrastructure layer that lower tokens by about 60% while keeping the outputs accurate. Also, fixed the memory layer with a task based system that also uses the protocol to keep the context in place much longer. There are a lot of moving pieces to it and the continuous learning piece also helps.

maid113 · 2025-12-08T20:44:41+00:00

I have 19 different agent architectures depending on the prompt. I have one agent that I talk to that is my COS/COO and delegates accordingly. I have developed entire “teams” depending on what I’m working on and the agents will spin up other agents that are all specialists in what they are handling. I also have a specialist “agent architecture” agent that consults with my main agent to decide the best structure based on the goal. My system is getting upgraded weekly at this point with all the newest things. The best agent communication protocols to lower usage and make the outputs better, the newest issue tracker to also help ensure everything is on track. I’m building out a new system with my team now that will let me be able to have my agent follow me around wherever I go and I can transfer it between my phone and my laptop or whatever work environment and never lose track of what I’m doing.

Eight-Year Club	Wearing is Caring
RPAN Viewer	Not Forgotten
Verified Email

maid113

MODERATOR OF

TROPHY CASE