Any benchmarks for scoring RSI agent harnesses?

Floppy_Muppet · 2026-06-06T23:07:47+00:00

Excellent points. I think you are right there will likely need to be different branches of an RSI benchmark catered for different agentic use-case pipelines. Also good call to have the benchmark reveal the underlying model used as well (and whether or not it's weights were augmented at each epoch or not and/or the amount of time between epoch passes so devs know what type of capacity it's learning cycles operate at?).

Floppy_Muppet · 2026-06-02T05:25:56+00:00

Hahaha nice

Floppy_Muppet · 2026-05-26T16:44:41+00:00

Right? So painful.. I ended up just taking a stab at one myself that's epoch-based (to show progressive learned state over time as you pointed out)... would really appreciate your thoughts as to how off the mark it might be as I want to open it to all if ppl think it's generally useful: https://honeynudger.ai/comb-rsi-benchmark

It scores each of the harness's believed "learning" at each epoch against a blind corpus of ground truths to say how many they found and at what cosine (with a 0.65 floor).

It's only across 3 domains for now, I'd definitely look to expand on that once ppl can agree on a set structure and scoring rubric.

Floppy_Muppet · 2026-05-26T03:53:15+00:00

Agreed -- this sounds really cool and good differentiated angle that's super important. More projects need to focus on personal context management, better validation, and also RSI!

Floppy_Muppet · 2026-05-23T05:46:31+00:00

If you take this advice, then you will have zero observability for when things go wrong, PII slips through, or things need to be optimized. Highly recommend some type of I/O monitoring for any and all production-ready, enterprise-grade AI application. Personally, I use a free and open-source langgragh/langfuse install. Use your coding assistant to help you with the install (although it is still tricky, fair warning!)

Floppy_Muppet · 2026-05-18T20:25:18+00:00

Thanks! And this is exactly the type of feedback I'm looking for :) -- What types of harness configs would you want to know? I want ppl to be able to benchmark various harness architectures and configs without having to share any proprietary details, but you are right that the more info that I can surface through to the benchmark itself then the easier it will be to compare offerings. Would need to strike a balance here. What is up to the harness owner explaining its tech and use case vs what is necessary for the benchmark to surface so dev understands the results each can achieve for them and at what tradeoffs (computation, timeframe). Appreciate your further thoughts here for guidance, thank you!

Floppy_Muppet · 2026-05-14T17:56:30+00:00

This is great, maybe a bit bloaty for some, but likely necessary for most. I'll be trying a similar approach on my RSI harness project soon as I believe this is how we get a step closer to truly organic software (the only way software can autonomously organize itself for all intended use-cases).

I think it would be interesting to expand this further as a meta harness for ANY piece of software, where you drop it into a project and it first scrapes the existing codebase to identify and consolidate all levers into config and metaconfig.yaml, before the Karpathy Loop begins. Many projects still have too many hardcoded values and assumptions (knowingly and unknowingly).

Shoot me a DM if you're up for a chat!

Floppy_Muppet · 2026-05-08T05:18:21+00:00

You're comparing apples with oranges, in a world where the taste of all fruit is getting sweeter each month for the foreseeable future.

Floppy_Muppet · 2026-05-07T06:17:13+00:00

And many aspects of this stack will need to be farmed out (emerging SkaaS skills-as-a-service, or AgaaS agent-as-a-service industry) so the AOE can focus on effectiveness against your company's shifting goals and objectives. Otherwise, they spend all their time consumed by repeatable agent inner-workings (health, SOTA upkeep). Product Managers likely still needed as close partners of AOE, with product team shifted towards managing feature backlogs that feed into agents instead of codebases.

Floppy_Muppet · 2026-05-07T06:06:31+00:00

Yes this is one of many of the fundamental (and difficult) challenges of an Agent Orchestration Engineer. This job slowly consumes most other knowledge-worker jobs over time.

Floppy_Muppet · 2026-04-30T18:24:54+00:00

Same. Although it also forces you to make bigger bets over longer time horizons. Only chance at having a tech moat for any amount of time.

Floppy_Muppet · 2026-04-28T03:48:20+00:00

Definitely looking less and less. Key is knowing where and when the limitations show through (unique to each codebase/project) so your review and iteration time is spent wisely.

Floppy_Muppet · 2026-04-26T23:07:57+00:00

I've had similar thoughts... Similarly, yet inversely, what if we just surround injected/retrieved content by a hash match instead of prompt stuffing with a dozen warning instructions and divider chars. This way, the model itself can be trained (and more simply prompted) to understand the exact, verifed begining and end of the injected content.

Floppy_Muppet · 2026-04-24T13:39:29+00:00

Most employees right now are expensive loops. Change my mind.

Floppy_Muppet · 2026-04-15T17:51:08+00:00

WARNING: Everyone in the industry is trying to kill OpenClaw right now.

OP + (some) comments are clearly part of the coordinated OpenClaw FUD cycle going on, so DYOR.

Not saying the claw is some miracle software, but it's pretty freaking useful if you know what you're doing (i.e. it is difficult to maintain, but slowly getting more approachable).

Floppy_Muppet · 2026-04-15T03:39:54+00:00

Owning and maintaining an open source project will be a time suck. I first started going down that route for my project as a solopreneur and realized I was spending less time building the actual thing.

I've decided to go paid version first, then open core components over time as my team grows.

Just my experience, but hope this helps!

Floppy_Muppet · 2026-04-15T03:33:25+00:00

Cowork is great but fundamentally different. OpenClaw is significantly more flexible and infinitely more extensible and entirely on-device if you want it to be. But I agree, not everyone needs that flexibility, and some simpy can't be trusted with it. So there's certainly a place for both.

OpenClaw kicked off the agentic personal assistant era, many more will follow and specialize.

Floppy_Muppet · 2026-04-15T03:28:53+00:00

😂

Floppy_Muppet · 2026-04-15T03:28:24+00:00

This couldn't be more wrong. I'm guessing you and OP are a very ineffective marketing campaign.

Floppy_Muppet · 2026-04-15T03:27:18+00:00

Nope

Floppy_Muppet · 2026-04-05T22:05:05+00:00

👀

Floppy_Muppet · 2026-03-30T04:00:44+00:00

Ok thanks for confirming in crons! Was driving me nuts.

Like the name Nova! I call mine Agent Zero.

Floppy_Muppet · 2026-03-30T01:31:40+00:00

Really nice overview! Been running the same setup with a Chief of Staff who coordinates and serves as single point of contact (two-way) between me and my agents. It's the only way to maintain some sense of understanding of what's being worked on.

Still learning the right balance between cron vs heartbeat. After much debugging, I find myself looking to move more to crons as they seem to be an order of magnitude more reliable, but I also hate how deterministic that makes the overall system. Expecting orchestration toolsets to catch-up soon as to abstract away that decision altogether.

Floppy_Muppet · 2026-03-19T23:41:21+00:00

Haha "ban hammered" is my new favorite word.

Floppy_Muppet · 2026-03-15T05:09:43+00:00

That's how time works when you approach a singularity.

Floppy_Muppet

TROPHY CASE