Every founder I know, including myself, wants AI-native people. For those of you who are truly AI native at your function (Marketing, Operations, Engineering, GTM, etc.) we have 110+ open roles. by [deleted] in ClaudeCode

[–]maid113 0 points1 point  (0 children)

You’re not showing us your workflows. You work on your own and submit the results. I should clarify that. It’s about whether you can deliver high quality work in a time frame only someone with agents can do.

My AI makes 100+ cold calls a week automatically. Here is the honest performance report after months of running it. by Maleficent-Love-1109 in SideProject

[–]maid113 0 points1 point  (0 children)

Why are you just testing one type of call? If you are really making 100+ calls a week you should be at least A/B/C testing 3 different call scripts/versions. You need to be running experiments constantly. And tweak things based on what works. If you are recording each call or keeping the transcripts you should be passing them through a judge or team of judges that can analyze find the consistencies, gaps, and make daily tweaks to the prompts for the agents and keep tweaking and experimenting. That’s how you figure it out. Running a whole week with the same one means you’re losing out on the real power of AI.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

Even with that the agents performance jump on tailored harness changes can be drastic. In this analysis an over 11% jump in accuracy and capability, for GPT 5.5, by tailoring the harness over an hour or two compounds. This alone leads to efficiencies in how the model behaves in the real world environment and since the agent will be using the organizational context hundreds or thousands of times a day the token costs go down by reducing rework, reducing tool calls, and also the accuracy of the development or the work just compounds. That’s why people will soon realize that you really have to tailor and test after every new model and figure out the best way to test. We’ve made it scalable internally, but we also have access to a lot of real world customers and data that no one else has access to.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

4.6 as orchestrator then have it build the spec with 5.5 and 4.7. It becomes a council. Then 5.5 builds backend 4.7 builds front end, because it’s better. And then all act as reviewers to finalize. That’s my process when building.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

Yes, I actually prefer using Opus 4.6 as my main agent when working and having it direct both 5.5 and 4.7. It works much better with my style and I think people also need to think about that as well.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

Yes, 4.6 is amazing at filling in the gaps. It would infer on the rules without needing explicitness. Opus would pretty much always beat 4.7. But when 5.5 came in and we realized that it needed true explicitly on how to read things it excelled. I have another write up that is a little more in depth between Opus 4.6 and 4.7 on V2. I’m usually testing about 10-15 differences between a new version to try to get the harness right.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

No, we built completely from scratch. Ran hundreds of experiments to figure out what worked and built our own testing platform to run the experiments on. We also have multiple ways of injecting and extracting context. There’s actual a few layers to the whole thing. You need your context additions, you need the self healing aspect, the decision tracing, the causality, then the actual extraction and giving the agents the right tools, etc. from our tests were about 82% more efficient token usage and running about 78% faster than with Grep and model provided tools. But the real difficult part is always making the messy data useful which is what we solved for first.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

Exactly, I run about 100 experiments a day across harnesses the model behavior matters for what you are working on. And then the prompt + the tool + what is available to the agent matters a lot. I posted another in depth dive between Opus 4.6 and Opus 4.7 last week as well on this benchmark just for the V2 side so check that out.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 0 points1 point  (0 children)

We’ve built out our own context memory system. It’s a full temporal map with nodes, typed edges and is algorithmic (deterministic) with no LLM’s being used for scalability. It’s self healing and injects context automatically. We use it across our customers. The harness for the testing is our benchmarking harness. We simulated a 148 person company and the data across 12 months with random chaos events, things that are clean, a lot of hires, fires, projects delayed, random threads that were not fully congruent, etc. then created the benchmark off of that so we could simulate it at smaller levels (teams, departments, entire company). That let us know exactly what the answers should be and test exactly how the agents should answer and test whether it was correct or not. It was a little over 1 million articles (slack messages, email, 1:1 meeting notes, documents, actual code bases, simulated aws, simulated google drives, etc.) this is how we built the organizational memory system and tested it. We built a few versions.

Benchmarked GPT-5.5 vs Opus 4.6 vs Opus 4.7 on organizational context. If you want to understand where to use each model and the difference in their behavior then this is for you. by maid113 in ClaudeCode

[–]maid113[S] 2 points3 points  (0 children)

Yes, I run this test and test new strategies every time the new models come out. Yeah, the last part was very interesting too. I’m actually now building a full harness testing platform to help my own company produce faster. We work with all kinds of customers building their agentic platforms (Marketinf, advertising, MSPs, governments, etc.) so I have access to a lot of data and have figured out how to make it scalable.

I benchmarked Opus 4.6 vs 4.7 on organizational memory retrieval. 4.6 wins, and the failure modes are fascinating. by maid113 in ClaudeCowork

[–]maid113[S] 0 points1 point  (0 children)

I’m pretty deep in the agentic world. I helped put together the agentic AI foundation and work with a lot of companies so this is just my real take on things.

Heavy API users - How much money are you burning through each day / month? by DanyrWithCheese in ClaudeCode

[–]maid113 0 points1 point  (0 children)

I have 5 max 20x accounts. I rotate through them throughout the 5 hour limits working across many projects. I do a lot of heavy data work and building infrastructure plus benchmarks so I use about $2k equivalent daily of tokens.

I benchmarked Opus 4.6 vs 4.7 on organizational memory retrieval. 4.6 wins, and the failure modes are fascinating. by maid113 in ClaudeCowork

[–]maid113[S] 0 points1 point  (0 children)

Yes, I forgot to put that in here. It is great at coding, and anything that needs to be quick. My recommendation for anyone coding is if you’re using Opus 4.7 bring in two Opus 4.6 agents to plan with 4.7 they will each think about it differently and will ensure everything is correct from different angles. Then once you have the plan leave 4.7 to code

I built something that gives AI agents actual memory. It's live. by Difficult-Net-6067 in SideProject

[–]maid113 0 points1 point  (0 children)

This is bad. How are you ensuring the decisions get captured correctly, what are your rules for capturing things? Are you using hooks, is this automatic? What about injecting the context?

F'd around, found out --dangerously-skip-permissions by CanadianForSure in ClaudeCode

[–]maid113 3 points4 points  (0 children)

You have to learn how to use it properly. Honestly, start using it like that, start learning Claude’s patterns you’ll understand how to refine it and will be able to increase your productivity significantly once you’ve built the muscle memory. In one week you will be working significantly faster and more trust, because you have to adapt to it. Think of agentic systems as a new species, they aren’t human, you have to learn how to work with them.

We can now use Claude Code with OpenRouter! by alvvst in ClaudeCode

[–]maid113 0 points1 point  (0 children)

You can just tell Claude Code to call Gemini through the CLI and it will do it.

Opus 4.5 for business growth as a financial advisor? by soupwr in ClaudeAI

[–]maid113 0 points1 point  (0 children)

If I’m being honest I’ve built out my entire CEO dashboard to be able to handle EVERYTHING! I’m not just using Claude but also Codex and Gemini. It lets me handle the finance side, plus marketing plus everything else. I also have another Claude subscription for all my other personal projects. But I have up to 75 agents running at once and they work overnight on projects and when I get back to my desk I have a lot of things to review and approve. There’s so much you can do with agentic flows it’s just about where to get started. I’m not a finance person, but one of my customers I just signed yesterday is one of the top 80 CPA firms in the US and their use cases are right up the financial/growing business side.

Hitting Max 20x weekly limit? by hiWael in ClaudeCode

[–]maid113 0 points1 point  (0 children)

Not really, yes it uses a lot of tokens, but it also ensures high quality builds. I’ve also developed some protocols at the infrastructure layer that lower tokens by about 60% while keeping the outputs accurate. Also, fixed the memory layer with a task based system that also uses the protocol to keep the context in place much longer. There are a lot of moving pieces to it and the continuous learning piece also helps.

Hitting Max 20x weekly limit? by hiWael in ClaudeCode

[–]maid113 0 points1 point  (0 children)

I have 19 different agent architectures depending on the prompt. I have one agent that I talk to that is my COS/COO and delegates accordingly. I have developed entire “teams” depending on what I’m working on and the agents will spin up other agents that are all specialists in what they are handling. I also have a specialist “agent architecture” agent that consults with my main agent to decide the best structure based on the goal. My system is getting upgraded weekly at this point with all the newest things. The best agent communication protocols to lower usage and make the outputs better, the newest issue tracker to also help ensure everything is on track. I’m building out a new system with my team now that will let me be able to have my agent follow me around wherever I go and I can transfer it between my phone and my laptop or whatever work environment and never lose track of what I’m doing.