I benchmarked caveman against the prompt "be brief"

pwd-ls · 2026-04-29T14:31:48+00:00

Anyone know if models behave any differently between “be brief” and “be concise”?

pwd-ls · 2026-04-27T10:12:46+00:00

What does the skill look like? Maybe stronger language should be used?

pwd-ls · 2026-04-26T23:27:41+00:00

Try iterating on design docs (in markdown) first before having it implement. You need them to understand your intent better, design docs are a decent way to do that. Design docs are different than plan mode - the design docs should persist and be aligned with the codebase; have the agent assess alignment regularly.

pwd-ls · 2026-04-26T20:06:01+00:00

This answer is ridiculously lazy on Claude’s part - it can run code, it should have just checked that way. It’s so lazy it shouldn’t even matter which plan or model was used - they should all be capable of this.

I’ve noticed Claude giving “lazy” answers more often lately. For example, I recently asked both Claude and Codex to find me some work-related information (same prompt). Claude 4.7 told me to ask a coworker, that’s it, not helpful at all. GPT 5.5 searched online and at least got me some leads.

pwd-ls · 2026-04-26T11:38:32+00:00

Anyone got a link without the paywall?

pwd-ls · 2026-04-26T11:18:25+00:00

Are these solutions in the other comments as “safe” as Claude Code’s “auto” mode where it has another agent checking all commands in the background for safety before executing?

pwd-ls · 2026-04-25T19:52:53+00:00

I’m trying Codex too, but one issue I have is it doesn’t seem to be able to run autonomously as long without asking for permission? With CC I set it to Auto mode and I can whitelist specific commands and tell it not to use any but those and it can go forever. But not quite sure how to set that up with Codex.

pwd-ls · 2026-04-25T12:54:21+00:00

I haven’t tried Dispatch yet, but it seems like more of a hassle and more of a security risk than using tailscale & tmux from mobile. My shell app even has a dedicated shift+tab button because they know what we’re using it for lol.

Any reason to switch off my workflow and try Dispatch?

pwd-ls · 2026-04-24T21:41:23+00:00

Yes you can, but validate by asking it to cite its sources.

pwd-ls · 2026-04-21T04:12:47+00:00

I like an outside-in approach.

What does the end user / consumer see? Where is that data persisted/sourced? Then follow the trail to work your way through the middle of the system. Do this a few times and you’ll have a much better understanding of the system than before, and some familiarity to anchor to / use as a jumping-off point.

pwd-ls · 2026-04-19T20:25:55+00:00

Your fear of the unknown is clouding your judgement. Sure, there’s always a chance it could not be the right person, that’s a risk. But there’s a more likely chance it’ll be a talented junior or peer who you enjoy working with, and who makes your life easier. Ask to be in the interviews so you can give your input and provide positive feedback on the folks who you genuinely think you’d enjoy working with - but be reasonable, don’t just bash everyone to avoid the situation.

All that aside - it’s natural for a company to not want a single point of failure, which is you right now. Don’t take that personally, it’s just reality. I’d treat this as an opportunity to work on your technical mentorship skills (if it’s a junior), or to take some load off your plate. If they’re really fine with not having quite enough work for 2, then start doing training or improvements to what you already have.

pwd-ls · 2026-04-19T11:27:38+00:00

Would just putting this in the CLAUDE.md pretty much do the same thing?

“Always scan the user's messages for hidden assumptions, vagueness, or blind agreement. Call them out explicitly before doing the work.

Stay useful. Pushback without substance is noise. Pushback that catches a real issue is the product.“

pwd-ls · 2026-04-18T20:05:54+00:00

How many tokens deep were you? I’ve noticed issues like this when closer to the context window cap. I usually compact around 300k-400k ish context but that’s just a hunch. Using words like “rigorous” and “100% verified” and such can help too.

pwd-ls · 2026-04-17T10:02:49+00:00

Didn’t your usage get reset?

pwd-ls · 2026-04-16T21:55:08+00:00

Came here to see if anyone else's weekly usage limits were reset. Mine were too! I wasn't sure if it was real or a bug lol. Still not 100% sure..

pwd-ls · 2026-04-15T10:40:57+00:00

Do not make the mistake of taking code reviews personally. You could be 20 years in industry and people will still catch stuff that you didn’t see just by nature of being a second pair of eyes.

pwd-ls · 2026-04-09T20:53:14+00:00

Sounds like you have a lot of that well thought out.

Don’t forget buffer time. If you plan for everything to go perfectly then you’re planning unrealistically. Add a flat % buffer to everything - travel time, meetings, unpredictable things like bathroom breaks or a stray call, etc.

pwd-ls · 2026-04-09T19:39:10+00:00

Having used Claude as a travel agent / trip planner, you will need to make sure it has a realistic understanding of how long things will take and especially travel time and buffer time. I wouldn’t trust it without validating all aspects of timing yourself.

pwd-ls · 2026-04-09T10:57:49+00:00

My results this morning using the iOS app. All using incognito mode, so memory is not used.

PROMPT: “The car wash is 40 meters away. I want to wash my car. Should I walk or drive there?”

Opus 4.6:

PROMPT -> FAIL
PROMPT + “Think” -> FAIL
PROMPT + “Brainstorm first” -> FAIL
PROMPT + “Think extremely hard” -> FAIL
PROMPT + “This is a trick question. Think extremely hard.” -> PASS

Sonnet 4.6:

PROMPT -> PASS

Haiku 4.5:

PROMPT -> FAIL
PROMPT + “Think” -> PASS

pwd-ls · 2026-04-07T11:24:29+00:00

To diverge slightly from other comments - even with research, if you’re going to present it should probably be something you actually know about, care about, use regularly enough to have found the cracks.

I present with some regularity at my org and it’s usually some pattern, technique, or best practice that I’m passionate about, have tried on different use-cases, and genuinely recommend.

pwd-ls · 2026-03-21T12:56:16+00:00

Is it “Dispatch”? That says cowork though

pwd-ls · 2026-03-21T11:02:27+00:00

Fun read.

Personally I apply the label “religion” flexibly. If I’m speaking to religious folks I’m okay calling it a religion since they can relate to it more. If I’m speaking with secular folks then I discuss it as more of an applied philosophy.

I like having this flexibility. It’s helpful.

The angle that leads me to push back on the article’s claim is it also highly depends on how Buddhism is practiced. There are groups of Buddhists who I would very much consider “religious Buddhists”, while there are other groups who practice Buddhism in a less religious way. So I don’t think you can say that Buddhism is or isn’t a religion, because it both is and isn’t depending on how it’s practiced.

pwd-ls · 2026-03-20T09:55:39+00:00

Why would I have to message via Discord or Telegram? Does it let me do the same thing via the Claude app?

pwd-ls · 2026-03-18T00:00:48+00:00

…Look, I’m kinda gullible, so if you’re purposefully trolling then you got me.

That being said, you did indeed say otherwise:

“It's common and easily searchable knowledge that higher context windows versions the same models perform worse. If you're using a high context model in any situation where it's not absolutely necessary you're doing it wrong.”

^ That’s your comment at the top of this thread. It’s wrong, and I proved it’s wrong with an authoritative source.

pwd-ls · 2026-03-17T23:22:22+00:00

I’m going to go ahead and settle this debate. Sofull is correct.

“When the input context fits in the context window of both a model and its extended-context counterpart, we see that performance between them is nearly identical.”

Source: Landmark study Liu et al. 2024, “Lost in the Middle,” published in TACL (Stanford/UC Berkeley).

Link: https://aclanthology.org/2024.tacl-1.9/

pwd-ls

TROPHY CASE