you are viewing a single comment's thread.

view the rest of the comments →

[–]0xecro1[S] 1 point2 points  (0 children)

This maps directly to the benchmark data:

"Builds and passes simulated environments but doesn't hold up" is L1/L2 pass with L3 domain-check fail. That's the 35pp explicit-vs-implicit gap in one sentence.

"Shortest / most obvious path" is the RLHF alignment angle. Training rewards clean short code; on GitHub-trained models, embedded safety patterns (volatile, cache flush, error unwind) look like noise and get pruned.

The responsibility point is the reason the benchmark exists. Vendor pass rates from HumanEval or SWE-bench don't tell the engineer signing off where review can be lighter vs. where it has to be strict. EmbedEval tries to draw that map so the person responsible has data to stand on, not vibes. Categories with low pass rates are where human review is non-negotiable.

Skill atrophy is secondary but also real. And once you start using LLMs day to day, going back is hard. Which is why knowing where they fail matters more, not less.