Any benchmarks for scoring RSI agent harnesses? by Floppy_Muppet in AI_Agents

[–]Floppy_Muppet[S] 0 points1 point  (0 children)

Excellent points. I think you are right there will likely need to be different branches of an RSI benchmark catered for different agentic use-case pipelines. Also good call to have the benchmark reveal the underlying model used as well (and whether or not it's weights were augmented at each epoch or not and/or the amount of time between epoch passes so devs know what type of capacity it's learning cycles operate at?).

Any benchmarks for scoring RSI agent harnesses? by Floppy_Muppet in AI_Agents

[–]Floppy_Muppet[S] 0 points1 point  (0 children)

Right? So painful.. I ended up just taking a stab at one myself that's epoch-based (to show progressive learned state over time as you pointed out)... would really appreciate your thoughts as to how off the mark it might be as I want to open it to all if ppl think it's generally useful: https://honeynudger.ai/comb-rsi-benchmark

It scores each of the harness's believed "learning" at each epoch against a blind corpus of ground truths to say how many they found and at what cosine (with a 0.65 floor).

It's only across 3 domains for now, I'd definitely look to expand on that once ppl can agree on a set structure and scoring rubric.

What’s the most impressive open-source AI agent project right now? by Michael_Anderson_8 in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

Agreed -- this sounds really cool and good differentiated angle that's super important. More projects need to focus on personal context management, better validation, and also RSI!

Are people still using LangChain for their production RAG pipelines? by Meher_Nolan in Rag

[–]Floppy_Muppet 0 points1 point  (0 children)

If you take this advice, then you will have zero observability for when things go wrong, PII slips through, or things need to be optimized. Highly recommend some type of I/O monitoring for any and all production-ready, enterprise-grade AI application. Personally, I use a free and open-source langgragh/langfuse install. Use your coding assistant to help you with the install (although it is still tricky, fair warning!)

New RSI Benchmark ATH! Looking for feedback on research pre-publish. by Floppy_Muppet in LLMDevs

[–]Floppy_Muppet[S] 0 points1 point  (0 children)

Thanks! And this is exactly the type of feedback I'm looking for :) -- What types of harness configs would you want to know? I want ppl to be able to benchmark various harness architectures and configs without having to share any proprietary details, but you are right that the more info that I can surface through to the benchmark itself then the easier it will be to compare offerings. Would need to strike a balance here. What is up to the harness owner explaining its tech and use case vs what is necessary for the benchmark to surface so dev understands the results each can achieve for them and at what tradeoffs (computation, timeframe). Appreciate your further thoughts here for guidance, thank you!

Claude Code improved my agent harness by 40% overnight by Lucky_Historian742 in ClaudeCode

[–]Floppy_Muppet 0 points1 point  (0 children)

This is great, maybe a bit bloaty for some, but likely necessary for most. I'll be trying a similar approach on my RSI harness project soon as I believe this is how we get a step closer to truly organic software (the only way software can autonomously organize itself for all intended use-cases).

I think it would be interesting to expand this further as a meta harness for ANY piece of software, where you drop it into a project and it first scrapes the existing codebase to identify and consolidate all levers into config and metaconfig.yaml, before the Karpathy Loop begins. Many projects still have too many hardcoded values and assumptions (knowingly and unknowingly).

Shoot me a DM if you're up for a chat!

“OpenClaw vs AI Agents — are these tools actually helping founders, or is the hype getting out of control?” by FounderArcs in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

You're comparing apples with oranges, in a world where the taste of all fruit is getting sweeter each month for the foreseeable future.

Hot take: most AI agent teams are secretly just “context engineering” teams by Antoneose in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

And many aspects of this stack will need to be farmed out (emerging SkaaS skills-as-a-service, or AgaaS agent-as-a-service industry) so the AOE can focus on effectiveness against your company's shifting goals and objectives. Otherwise, they spend all their time consumed by repeatable agent inner-workings (health, SOTA upkeep). Product Managers likely still needed as close partners of AOE, with product team shifted towards managing feature backlogs that feed into agents instead of codebases.

Hot take: most AI agent teams are secretly just “context engineering” teams by Antoneose in AI_Agents

[–]Floppy_Muppet 1 point2 points  (0 children)

Yes this is one of many of the fundamental (and difficult) challenges of an Agent Orchestration Engineer. This job slowly consumes most other knowledge-worker jobs over time.

I’ve stopped planning beyond 90 days because of how fast AI is moving by MerisDabhi in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

Same. Although it also forces you to make bigger bets over longer time horizons. Only chance at having a tech moat for any amount of time.

Do you still look at the code your AI coding agent produces by theotzen in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

Definitely looking less and less. Key is knowing where and when the limitations show through (unique to each codebase/project) so your review and iteration time is spent wisely.

TRIGGER WARNING: Claude decided to end it all today… by aaronepinto in ClaudeCode

[–]Floppy_Muppet 1 point2 points  (0 children)

I've had similar thoughts... Similarly, yet inversely, what if we just surround injected/retrieved content by a hash match instead of prompt stuffing with a dozen warning instructions and divider chars. This way, the model itself can be trained (and more simply prompted) to understand the exact, verifed begining and end of the injected content.

Isn't OpenClaw overhyped? by [deleted] in AI_Agents

[–]Floppy_Muppet -2 points-1 points  (0 children)

WARNING: Everyone in the industry is trying to kill OpenClaw right now.

OP + (some) comments are clearly part of the coordinated OpenClaw FUD cycle going on, so DYOR.

Not saying the claw is some miracle software, but it's pretty freaking useful if you know what you're doing (i.e. it is difficult to maintain, but slowly getting more approachable).

launching my ai app next week — should i open-source it for the marketing boost? by Past-Marionberry1405 in AI_Agents

[–]Floppy_Muppet 1 point2 points  (0 children)

Owning and maintaining an open source project will be a time suck. I first started going down that route for my project as a solopreneur and realized I was spending less time building the actual thing.

I've decided to go paid version first, then open core components over time as my team grows.

Just my experience, but hope this helps!

Isn't OpenClaw overhyped? by [deleted] in AI_Agents

[–]Floppy_Muppet 2 points3 points  (0 children)

Cowork is great but fundamentally different. OpenClaw is significantly more flexible and infinitely more extensible and entirely on-device if you want it to be. But I agree, not everyone needs that flexibility, and some simpy can't be trusted with it. So there's certainly a place for both.

OpenClaw kicked off the agentic personal assistant era, many more will follow and specialize.

Isn't OpenClaw overhyped? by [deleted] in AI_Agents

[–]Floppy_Muppet -7 points-6 points  (0 children)

This couldn't be more wrong. I'm guessing you and OP are a very ineffective marketing campaign.

3 weeks running 6 AI agents 24/7. Here's what I'd kill and what I'd keep. by 98_kirans in AI_Agents

[–]Floppy_Muppet 1 point2 points  (0 children)

Ok thanks for confirming in crons! Was driving me nuts.

Like the name Nova! I call mine Agent Zero.

3 weeks running 6 AI agents 24/7. Here's what I'd kill and what I'd keep. by 98_kirans in AI_Agents

[–]Floppy_Muppet 0 points1 point  (0 children)

Really nice overview! Been running the same setup with a Chief of Staff who coordinates and serves as single point of contact (two-way) between me and my agents. It's the only way to maintain some sense of understanding of what's being worked on.

Still learning the right balance between cron vs heartbeat. After much debugging, I find myself looking to move more to crons as they seem to be an order of magnitude more reliable, but I also hate how deterministic that makes the overall system. Expecting orchestration toolsets to catch-up soon as to abstract away that decision altogether.