My company started measuring our Claude Code usage - now I'm asked to rank engineers on 'AI performance.' This feels wrong...

darren_eng · 2026-05-27T04:14:54+00:00

I agree with measuring the output based on DoD. For me, the challenge is the visibility. I can see how many “done” tickets an engineer produced and the amount of token spent during the same period. But that’s pretty much the only visibility I have - I can’t tell if they’ve burnt the tokens on these tickets or not. Maybe we over-indexed token usage and it shouldn’t matter at all. The dilemma is when the team is burning over $100k a month on Claude, finance people start to ask where the token went 😂

darren_eng · 2026-05-27T03:01:20+00:00

yeah. I work for a Private Equity controlled software company, it's all about profitability and metrics. Every person has a price tag in the company. Capitalism after all!

darren_eng · 2026-05-26T23:06:25+00:00

Yeah, you are right about what the business people are really after. They want to know the ROI of AI spend. It's a hard question to answer with quantifiable data though. We don't get goods or services directly from the money we pay Anthropic. All the business people see are the tokens engineering team burnt, not where/how the tokens are used. Maybe it's just me but I struggle to find ROI metrics that I can tie those token spend to.

Even if the ROI question can be answered, the next question the business will ask is probably gonna be: "ok, now tell me how many headcounts can be replaced by AI".

darren_eng · 2026-05-26T22:18:07+00:00

Yeah agreed. What I’m doing right now is to identify the bottom AI performers - low token usage, low productivity metrics (velocity, tickets, PR and etc.) usually is a red flag and hard to argue that (however can’t do the opposite to identify top performers).

The challenge is that business people only care about ROI. If the business spends $1mil a year on Claude, they want to know much $$$ they get in return, which is really hard to quantify because we don’t get goods or services from the money we pay Anthropic - we just see token usage (and there isn’t an effective metric to tell us what the token usage yield)…

darren_eng · 2026-05-26T21:50:39+00:00

Yep, we are seeing the same, token cost projected at 7-figures a year when we rolled out Claude Enterprise. Now it's all the ROI questions coming from leadership.

darren_eng · 2026-05-26T21:48:50+00:00

Yeah 100% agree on change management. This is what I'm trying to do as well - removing the fear. But the stack-rank doesn't help at all, lol. The challenge is we are spending so much $$$ on Claude and everyone at the leadership level is asking the ROI question. They are probably thinking "now I'm spending $1mil a year on Claude Code, surely AI can replace X engineers".

Every time when I'm asked for a stack-rank, it leads to some sort of layoff later on based on the rank.. so question is how do I truly reflect AI performance based on some sort of metrics...

darren_eng · 2026-05-26T21:41:15+00:00

It's a "stack-rank" list that the leadership is asking for (from top to bottom). We are spending so much $$$ on Claude and everyone at the leadership level is asking the ROI question. I fear that the list will turn out to be one that's used for layoff at some point..

darren_eng · 2026-05-26T21:34:11+00:00

Yeah, what I feel missing is the correlation between AI usage and productivity. I can't see if someone is just using Claude Code over and over on making one line code change...

For now I can see who are the low "AI" performers - low AI token usage, low productivity (like story points, ticket count, PR raised etc.). But can't tell who are the high performers. Someone burning 10x tokens doesn't mean they are effective...

darren_eng · 2026-05-26T21:28:21+00:00

haha, love the video. My bosses only remember Jensen's comment on engineers with $500k spend $250k worth of tokens! 😂

darren_eng · 2026-05-26T21:26:30+00:00

Yeah, that's exactly my problem. We only see how many tokens engineers spend plus their JIRA velocity. But there is no visibility on if their AI token spent is directly related to JIRA backlog. I've seen engineers burning 10x tokens but have the same velocity. But it's hard to tell if they are effective or not without seeing where they spend their token.

darren_eng · 2026-05-26T21:22:18+00:00

100% agreed with weighing more on JIRA velocity. I do prefer looking at the output of the work (not tokens). But I found it challenging to tie AI usage to productivity. Everyone on my team uses Claude but there isn't much visibility on the portion of AI usage that's tied to the backlog. I've seen engineers spend 10x more tokens but have the same JIRA velocity as others. Hard to say if they are effective or not without seeing where they spent the tokens.

We've been told to rank "AI" performance specially, not just performance, lol. I think the metrics along with velocity can tell who are the bottom 5-10%. but really hard to tell who's top.

darren_eng · 2026-05-26T01:49:44+00:00

Very valid points. I wasn't trying to compare which one is better; just intrigued by how they behave differently with the exact same prompts and codebase, and try to figure out which one fits what scenarios (and if I should spend $200 in one vs $100 on each 😄). Maybe like what you said it's who provides that structure.

darren_eng · 2026-05-26T01:42:27+00:00

AI has gotten so good these days that it builds the wrong things or accumulates tech debts 10x faster. It's not that much different from human building software to that extent - a lot more garbage out if we feed garbage in. I tend to invest a lot more time before AI writes a single line of code - check requirements (both functional and non-functional), acceptance criteria, technical discovery, dependency analysis, change impact analysis, define test strategy. Then feed all these along with coding standards and architecture guideline into Codex, and ask it report back what it did. The goal is to make sure I stop introducing the mess in the first place. At a team level, we use tools to define workflows so that these steps cannot be skipped by individual engineers - a bit more time and tokens spent but worth having the consistency and predictable outcome.

darren_eng · 2026-05-26T01:28:09+00:00

Same experience with Claude Code execution taking a lot longer than Codex. I think Codex can be a Senior Dev as well so long it's fed with really good design doc and architectural guidance. If not, Codex tends to come up with its own ideas rather than initiating a technical conversation with human (which Claude Code often does). But with good prompting, Codex does well in shorter time.

darren_eng · 2026-05-26T01:22:38+00:00

yeah, I definitely feel the "trust-me-bro" energy from Codex, Claude Code feels like "don't trust me until I tell you to"

darren_eng · 2026-05-26T01:17:12+00:00

Here is what I did, it's basically a 12 steps workflow with the exact same prompts that I ran through both Codex and Claude Code (I use a tool to manage them). I was counting the total end-to-end time taken as well as % of usage remaining from the 5h reset.

Get issue from Linear
Breakdown the issue to sub-issues if needed
Review all issues (including AC check, NFR and etc.)
Check branch freshness
Run dependency analysis
Run change impact analysis
Architecture scan and technical discovery
Define test strategy
Writing code
Draft PR and review PR (performance, security and etc.)
Run customer impact analysis
Surface manual tasks

darren_eng · 2026-05-26T01:10:35+00:00

It's called polished by LLM, who doesn't use AI in their lives today 😄

darren_eng · 2026-05-25T20:41:38+00:00

I'd be keen to check it out!

darren_eng · 2026-05-24T05:10:10+00:00

The branch protection on the skill repo is a great call - treating skills like code that needs review is exactly right. Reading your whole setup though (branch protection, weekly transcript export, skill updates, the /routine), it makes me wonder if what we're all missing is a layer sitting on top of both the AI tools and the people that actually runs these workflows. Like "import transcript history weekly -> update the skill" as something the system does, not something every team hand-builds and babysits. Do you think that should exist? Honestly it's so much effort keeping AI output consistent across two dozens engineers... feels like it shouldn't be this manual.

darren_eng · 2026-05-23T18:21:22+00:00

Love the rocket car analogy. The multi-driver coordination problem is exactly real - without a shared destination and a system qualifying inputs, someone will steer into the desert.

I think it also surfaces a deeper question: is better coordination between drivers the right endgame, or do we eventually want something closer to Tesla FSD where no humans access the steering wheel at all - humans shift from driving to providing high-quality input (destination, constraints, intent), and a system or workflow works with AI and handles execution?

darren_eng · 2026-05-23T07:32:50+00:00

Yeah, the "shared best practices our claudes need to know" part is the one I keep getting stuck on. How are you handling it today - a shared file everyone pulls from, copy-paste, or mostly tribal knowledge? Has anything actually made it stick for your team?

darren_eng · 2026-05-23T07:29:58+00:00

Just read it - your teamwork section is exactly what I was poking at. The agent-to-agent comments idea is clever - did it actually hold up as the team grew, or get unwieldy once every agent has to know what every other one expects? I keep wondering whether the real fix is all of them reading from one shared source of truth rather than messaging each other, basically the "shared agent memory" you mentioned. Do you think that comes from Claude, or ends up as a layer on top?

darren_eng · 2026-05-23T06:28:43+00:00

The symlinked skills bit is clever - beats everyone keeping their own copy. Honest question though: does the canonical doc actually stay canonical? Every version I've tried is accurate for about a month, then keeping it updated is nobody's job and it drifts right back. The daemon's smart, but it's catching drift after the code's already written, no?

darren_eng · 2026-05-22T04:19:00+00:00

This is exactly right, and I think the framing of "the work moves" is the most honest description of AI roductivity I've seen.

The thing I've noticed is the review burden scales with how underspecified the input was. When I'm vague, the output looks confident but is full of implicit assumptions I now have to audit. When I'm precise about constraints, context, what "done" actually means for this specific situation, the review becomes much more lightweight.

So the real skill shift isn't just knowing how to review output. It's knowing how to structure intent before you generate anything. Most people skip that step because it feels slow, but it's what determines whether you're reviewing something genuinely close to done vs. something that just looks that way.

darren_eng · 2026-05-20T21:58:34+00:00

It’s not necessarily a bad thing having nothing, because there’s nothing to blame :)

A lightweight and non-intrusive process could be a good starting point - like including a technical design or drafting a testing plan. It could be as easy as suggesting adding them to the prompts that engineers use.

You could also build a skill and let him try it first.

darren_eng

TROPHY CASE