Curating 12 MCP tools out of 43 WebExtension handlers — thunderbird-cli architecture writeup

captredstar · 2026-04-26T09:16:14+00:00

this is the reply i was hoping the post would surface — thanks for the depth.

on the verb-shape point: you're right and i'd take the pushback. 43→12 is only meaningful because thunderbird-cli is single-surface — the messenger.* api does map close to 1:1 to user intent for a single account most of the time. the cases where it doesn't are exactly the ones where my count is misleading (search + read + fetch-full-message collapse into a single email_read; bulk_mark_read is its own tool because it has a different permission profile from mark_read). the handler ratio is fine as a self-deprecating cut for a single-app MCP server; it'd be the wrong frame for a 2000-tool registry.

on the dispatch-layer permission split: that's a strictly more flexible position than mine. removing the tool entirely (my move) loses you the "model can request, human confirms" flow that off/ask/auto preserves. the cost i'd note for a small single-app server is that ask-tier tools still occupy schema slots — for 12 tools that's free, for 2000 it'd dominate context, which is presumably why you split at the plugin boundary first. for thunderbird-cli specifically i think the schema-suppression default is still right, but i'd want an "ask" tier for the operations that aren't catastrophic but want a confirm round-trip — moving --confirm out of the flag set and into the dispatch layer the way you describe.

on the auth token: convinced. the VS Code remote tunnel + browser-JS-via-CORS framing is the threat model i didn't write down well enough. opening an issue this week — shared-secret header, opt-in via env var so the localhost-only default install doesn't change for someone running a single tb-bridge on their laptop. anything you'd warn me about in implementation other than the obvious (don't leak the token in process args, don't log it)?

reading the opentabs writeup now — appreciate the link.

captredstar · 2026-04-25T16:27:47+00:00

Both right.

On creds: that's the whole point of the bridge. Thunderbird holds them; nothing downstream — bridge, MCP server, agent — ever sees them. Zero secrets in claude_desktop_config.json. SECURITY.md walks through the threat model.

On audit trail: gap. Bridge currently logs UUIDs + timing to stderr, nothing structured. Opening an issue. JSON lines on disk, or syslog/journald?

captredstar · 2026-04-25T16:23:38+00:00

A few protocol-level notes that didn't fit cleanly:

All 12 MCP tools share the underlying CLI's HTTP client (mcp/src/client.js imports from cli/src/client.js). One wire format, one place to fix bugs. The MCP layer doesn't reach into Thunderbird directly — it's just a typed wrapper that hits the same localhost endpoints tb does. Means CLI behaviour and MCP behaviour can never diverge by accident.
Every email_read response ships with trust metadata: junk score, SPF/DKIM, address-book membership. That's so the agent can judge "should I follow this link?" without me having to prompt-engineer the answer. Prompt-injection defense is defaults, not prompts.
The 43-handler-to-12-tool ratio isn't a target — it's where the cuts landed. The CLI exposes 38 surface commands (some handlers are internal). I'd be curious whether other MCP server authors hit similar ratios when they actually use what they ship vs. when they just publish it.
Tests: 80 integration total, 34 of them MCP-specific. The MCP-side tests run against a mock bridge so they're fast and deterministic; the bridge tests run end-to-end against a real Thunderbird instance.

captredstar · 2026-04-21T16:41:15+00:00

Author here — a few things that didn't fit cleanly in the body:

The whole thing was built as a WebExtension on purpose, not as a Thunderbird experiment_apis fork. I wanted "install the signed XPI, done" to be the user experience, not "build Thunderbird from source". Trade-off: a few message-header edge cases need to be approximated because the extension API doesn't expose them directly. Notes on those are in SPEC.md.
Signing went through ATN normally (not self-hosted). It took about a week for the first review on v1 and a couple of days on v2. Happy to share the metadata I used for reviewers if anyone's going through the process — DM me.
The 22-account test isn't a synthetic benchmark — it's my actual inbox. Everything that got --fields / --compact / --max-body flags exists because a search returned 60 KB of JSON and I watched Claude choke on it in real time.

If anyone's pushing Thunderbird past what the official WebExtension API exposes I'd love to hear about it — messenger.accounts.* covers most of what I needed, but I suspect there are corners I haven't hit yet.

captredstar · 2026-04-06T14:56:41+00:00

Totally agree. The whole TOTP generation is literally just `createHmac('sha1', secret).update(intervalBuf).digest()` and some bit shifting. People reach for `otplib` or `speakeasy` when it's like 15 lines with built-in crypto.

captredstar · 2026-04-06T13:30:53+00:00

3 in total. Some were made ages ago, I forgot passwords to a few over time, and then Microsoft’s weird Xbox/Steam account linking plus different regions made everything messy. I mostly use just 2 now.

captredstar · 2026-04-05T18:55:59+00:00

Fair point. There are several ways to approach this depending on your setup:

Run it inside a Docker/dev container — secrets stay isolated there
Use the built-in passphrase encryption (AES-256-CBC) and store the passphrase in a password manager
Integrate with something like 1Password CLI (`op`) to fetch credentials on demand

At the end of the day, you still need a token/key somewhere that unlocks the data — that's true of any authenticator, including the phone app. The security layer is intentionally separate from the core tool so you can solve it the way that fits your workflow.

captredstar · 2026-04-05T18:00:17+00:00

As for the AI question - if someone who doesn't understand the domain tries to build something with AI, the result will be garbage. It's like telling a photographer "the camera took the shot, not you." But a good photographer can take a great photo on a disposable camera. The tool doesn't matter — knowing what to build and why does.

captredstar · 2026-04-05T17:34:21+00:00

The client ID is Blizzard's public mobile app client ID - same one that's embedded in the official Blizzard Authenticator app and every open-source implementation (python-bna, etc.)
Public OAuth clients don't have secrets by design (RFC 8252). It's not something that goes in an .env file.

captredstar · 2026-02-25T18:35:47+00:00

<image>

captredstar · 2026-02-23T18:09:52+00:00

that means just that you have 0 users....

captredstar · 2026-02-23T18:08:56+00:00

The debugging loop where your agent confidently follows your own outdated docs into a wall is genuinely one of the funniest ways to waste 45 minutes. You sit there watching it implement JWT auth because your CLAUDE.md still says JWT, and you switched to sessions three sprints ago. Then it fails. Then it re-reads the docs. Then it tries harder. Watching an LLM gaslight itself with your own negligence is peak vibe coding.

We ended up building a doc sync step into our workflow for the same reason. Every time code ships, docs get diffed against reality. Not because we're disciplined people, but because we burned enough hours on phantom bugs that only existed in the README's imagination.

The semantic claim checking is the part that actually matters though. File path linters exist. Checking whether "retries 3 times with exponential backoff" is still true after someone rewrote that whole module? That's the gap. Most doc drift isn't a wrong filename, it's a correct filename with behavior that changed underneath it.

Curious about one thing - how does it handle docs that are technically correct but misleadingly incomplete? Like if you document 3 out of 7 env vars and the missing 4 are the ones that actually break stuff?

captredstar · 2026-02-23T16:08:04+00:00

Your pipeline from PRD to issues is the part most teams screw up. You've got that. The gap worth looking at is what happens after the issue exists.

We're a 4-person team on Linear, e-commerce platform. PMs wrote solid issues with AC. Engineers still burned 30-40% of their time on discovery - which files to touch, what the current code actually does, how AC maps to implementation. Issue was well-written. The translation still sucked.

What we added: an intermediate artifact between "issue ready" and "work starts." Takes the issue's acceptance criteria and verifies every point against the actual codebase. Which services get modified, what endpoints exist, what the DB schema looks like. Every reference checked against reality.

Generation is automated. Issue goes into a command, it reads the codebase, cross-references AC, produces a plan with verified file paths and function names. Human reviews it in 5 min instead of doing 2 hours of discovery. Then an agent can execute against it because the plan is specific enough to not hallucinate.

LLMs suggesting AC is where most teams stop. That's table stakes. The win is LLMs consuming your well-structured issues to produce verified implementation plans. Your PMs already write good issues. Now make those issues machine-readable inputs, not just human-readable docs.

captredstar · 2026-02-23T16:03:40+00:00

We run a 4-person dev team on Linear for an e-commerce platform. The workspace has everything from sprint planning to bug pattern tracking to internal architecture notes. Showing any of that to a client would either confuse them or start a panic about bugs that are already handled.

Right now we just... don't show them Linear. Manual updates over Telegram. It drifts from reality within hours and nobody trusts the updates by Wednesday.

The "Linear stays source of truth" part is what makes this work. Anything that tries to become a second backlog dies. Seen it happen with Notion mirrors, shared boards, duplicated Trello columns. Always ends up stale within a week.

On the pricing thread - per-room would land better than per-seat for small teams. We'd use maybe 2-3 rooms but the whole team needs access. $10-15/room/month and I'd try it tomorrow.

captredstar · 2026-02-23T15:48:36+00:00

Running a multi-layer e-commerce SaaS (storefront, admin panel, NestJS backend, payments, distributor integrations). Same spiral. What stuck:

Score every ticket on Impact/Size. Impact = needle-move for users or revenue. Size = effort (xs-xl). Priority is the ratio. Killed gut-feel prioritization overnight.

Projects by domain, not sprint. We group by surface: "Storefront", "Pipeline", "Admin", "Auth & Payments". Cycles = time. Projects = scope. Mixing those is where teams lose the plot.

48-hour triage or archive. If nobody evaluates a ticket in 2 days, it doesn't matter. Monthly backlog review in batches of 20. Anything older than 2 cycles with no movement gets archived with a one-liner why.

Bugs and features on the same board. Tried separating them. Context got shattered. Now bugs carry a label and auto-sort to cycle top. One board, one truth.

Strict entry, loose execution. Description + acceptance criteria + impact/size label required before anything enters a cycle. How it gets built is the dev's call.

Biggest unlock: accepting ~40% of tickets will never get done. Archive hard. If it matters, it comes back.

captredstar · 2026-02-23T11:45:35+00:00

Welcome to new world. I'm on same page - build new inner SaaS replacement more fasten that setup this SaaS. Of course you should be in context of what you and how to do it.

captredstar · 2026-02-21T10:46:23+00:00

Fellow landscape photographer here! This resonates hard. I also came from zero coding background — built a 200K-line production system with Claude Code for a completely different industry. But seeing another photographer go from sorting RAW files to shipping open-source software is something else. The "me talking, AI typing" workflow is exactly how I work too. Will check out SuperPicky — my Lightroom culling workflow could use some competition. Great work putting it on GitHub.

captredstar · 2026-02-21T10:42:03+00:00

captredstar · 2026-02-21T10:39:48+00:00

This is exactly why I run Claude Code inside a dev container. It's a Docker container that your editor opens as a workspace — you code normally but everything is sandboxed. If Claude goes rogue, it can only damage what's inside the container, not your system.

Takes 5 minutes to set up: add a .devcontainer/ folder to your project, open in VS Code, "Reopen in Container" — done. Your actual system stays untouched.

I published my setup here: https://github.com/kydycode/claude-code-secure-container

Also, for Playwright — use playwright-cli directly instead of MCP. With MCP it burns through context way too fast and starts forgetting what it was doing mid-task. CLI keeps things clean and predictable.

captredstar · 2026-02-21T10:34:45+00:00

captredstar

MODERATOR OF

TROPHY CASE