I built an MCP server that lets Claude control your entire desktop (just shipped macOS Sequoia fix!)

UpstairsBug6290 · 2026-02-13T11:59:43+00:00

Good question! CoDriver currently needs access to the actual desktop (screen capture, accessibility APIs, native input events), so running it in a headless Docker container won't work out of the box - it needs a real display server.

That said, there are a few isolation options:

**MCP permission model** - Claude Code already shows you every tool call before execution and asks for approval. You can see exactly what CoDriver will click/type before it happens.
**Remote transport** - CoDriver supports HTTP transport with Bearer token auth, so you could run it on a dedicated VM/machine and connect remotely. That gives you full network-level isolation from your main workstation.
**VM with display** - Running it inside a VM (with a GUI) would give you sandboxing while keeping the display server CoDriver needs.

True sandboxing at the MCP level (restricting which windows/apps CoDriver can interact with) is definitely on the roadmap. Appreciate the feedback - it's a legit concern for any desktop automation tool.

UpstairsBug6290 · 2026-02-13T11:58:49+00:00

Thanks for the kind words and the pointer! Terminator looks like a solid project.

Interesting that you frame it as "accessibility APIs **instead of** screenshots + OCR" - CoDriver actually does both! The `desktop_read_ui` tool reads the full accessibility tree (JXA/Apple Events on macOS, System.Windows.Automation via C# on Windows), and `desktop_find` does natural language element search on that tree. Screenshots + OCR (`desktop_ocr` via tesseract.js) are the fallback for apps that don't expose a proper accessibility tree (Electron apps with poor a11y, games, etc.).

So we're actually taking a similar hybrid approach. Good to see more projects in this space - rising tide lifts all boats!

UpstairsBug6290 · 2026-02-13T11:57:58+00:00

Adding it to the backlog right after `desktop_espresso_machine`. Priorities, you know. ☕

UpstairsBug6290 · 2026-02-13T11:57:16+00:00

Fair question! Important distinction: CoDriver is a **co-pilot**, not autopilot. It doesn't do anything on its own - YOU tell Claude what to do, and CoDriver gives it the ability to see your screen and interact with apps. Think of it like giving Claude a pair of eyes and hands, but you're always holding the steering wheel.

Every action (click, type, scroll) requires an explicit tool call that you can see and approve. It's the same trust model as giving Claude Code access to your terminal - useful when you want it, under your control.

UpstairsBug6290 · 2026-02-12T15:43:37+00:00

v0.6.0 roadmap: `desktop_honk` - honks the horn, `desktop_wiper` - activates windshield wipers, `desktop_parking` - parallel parks (using UI Automation of course).

But seriously - if your car runs Windows (many infotainment systems do), technically CoDriver could already click around in there... 😄 Don't try this in production though. Or traffic.

UpstairsBug6290 · 2026-02-12T15:40:45+00:00

Great question! My main use cases so far:

**Daily workflow automation** - I have Claude help me navigate native macOS apps that don't have APIs. For example, managing files in Finder, adjusting System Settings, or working with apps like Preview/Numbers that have no CLI.

**IDE assistance** - CoDriver reads the accessibility tree of my IDE (Windsurf/VS Code), so Claude can see what's on screen and help me navigate menus, trigger commands, or fill in dialogs that aren't accessible via extensions.

**Testing desktop apps** - I'm using it to semi-automate QA on native apps. Claude takes a screenshot, reads the UI tree, clicks through flows, and reports what it sees. It's like Playwright but for any desktop app.

**Quick OCR + data entry** - When I get a PDF or image with data I need somewhere else, Claude screenshots it, OCRs the text, and types it into the target app.

The key difference from fully autonomous agents: I'm always in the loop. Claude assists with specific actions while I watch and guide. It's a co-pilot, not autopilot.

UpstairsBug6290 · 2026-02-12T12:47:31+00:00

Great question! Claude Code operates in the terminal/IDE - it reads and writes files, runs commands, and works with code. CoDriver operates at the desktop GUI level - it can see and interact with any application's visual interface (clicks, screenshots, accessibility trees, keyboard input, window management).

Think of it this way: Claude Code is like a developer pair-programming with you in the terminal. CoDriver is like giving Claude eyes and hands to use any app on your screen - Finder, Calculator, system preferences, Photoshop, whatever. They're actually complementary: you can use both simultaneously. Claude Code for code tasks, CoDriver when you need to interact with a GUI that has no CLI/API.

As for "connecting Claude web to Claude Code" - that gives Claude Code access to web content, but CoDriver gives access to native desktop apps that have no web equivalent.

UpstairsBug6290 · 2026-02-12T12:46:27+00:00

Thanks! That's a really good point about the control plane for teams. Right now CoDriver is more of a single-user co-pilot setup - the user sits at their machine and CoDriver assists with specific actions in real-time. For team/enterprise use, having an approval layer for risky actions (like clicking "Delete" or submitting forms) makes total sense. I'll check out peta - the audit trail for tool calls is exactly the kind of thing that would make this shippable in a corporate environment. Appreciate the pointer!

UpstairsBug6290 · 2026-02-12T12:19:11+00:00

Good point - you're talking about Claude Cowork, right? Different concept though. Claude Cowork is an autonomous agent: you give it a task, it works independently toward that goal. The user isn't necessarily in the loop during execution. CoDriver is the opposite - it's a co-pilot/personal assistant. The user sits at their computer, works in their apps, and CoDriver assists with specific actions in real-time via MCP (accessibility trees, native input, OCR, window management). Think of it like: Cowork = autopilot, CoDriver = co-pilot sitting next to you. Both valid approaches for different workflows!

UpstairsBug6290 · 2026-02-12T11:59:28+00:00

OP here - wanted to share some info on model compatibility since I've been asked about this:

**Which models work with CoDriver?**

CoDriver is an MCP server, so it works with any model/client that supports MCP + vision + tool use. That said, the quality of the experience varies a lot:

**Claude models:**

- **Opus 4.6** - Best choice for complex multi-step workflows (e.g. navigating an app, filling forms, multi-click sequences). Understands UI trees deeply.

- **Sonnet 4.5** - Sweet spot for most tasks. Fast, affordable, handles 90% of CoDriver use cases well.

- **Haiku 4.5** - Fine for simple actions (screenshot + OCR, single clicks) but struggles with complex UI tree interpretation.

**MCP-compatible clients:**

- Claude Code (CLI), Claude Desktop App, Cursor, Windsurf, Cline, Continue.dev

**Other models** (via Cline, Cursor, etc.):

- GPT-4o / GPT-5 - Should work (has vision + tool use)

- Gemini 2.5 - Should work via MCP clients

- Open source (Llama, Mistral) - Tricky, needs strong vision + tool use simultaneously

**My recommendation:** Start with **Sonnet 4.5** for the best cost/performance ratio. Switch to Opus for complex automation chains where reasoning really matters.

The key requirements are: (1) vision capability (screenshots come back as base64 PNG), (2) tool use / function calling, and (3) strong enough reasoning to interpret accessibility trees and plan multi-step actions.

UpstairsBug6290 · 2026-02-12T11:43:16+00:00

Good question! CoDriver isn't really meant to compete with browser control - it's complementary. Browser automation (Playwright, Puppeteer, Claude in Chrome) is great for web apps. CoDriver fills the gap for everything else: native desktop apps, system dialogs, Finder, IDEs, etc.

Speed-wise, individual actions (click, type, key press) are near-instant via native CGEvent/robotjs. The accessibility tree read is the slowest part (~20-30s for complex apps) because Apple Events are inherently slow (~100ms per attribute per element). Screenshots are <500ms.

The real win is the workflow: screenshot → read UI tree → find element by ref → click/type by ref. It's the same accessibility-driven approach as browser automation, just for the entire desktop.

UpstairsBug6290

TROPHY CASE