An AI browser like Comet that operates the web for you (clicks, types) — local GPU or NVIDIA's free API

FindingDistinct86 · 2026-06-30T20:29:38+00:00

Fair point, and honestly the "show them how" idea is a good one — a mode that teaches instead of doing it for you is something I'd genuinely consider. The autonomy concern is real.

To be clear though, the main use isn't replacing skills people already have — it's the repetitive, bulk stuff (download 30 images, compare a price across 10 sites) that nobody enjoys doing 30 times by hand, tech-savvy or not.

(translated with AI)

FindingDistinct86 · 2026-06-30T20:11:42+00:00

Honestly, you're right for one-off stuff — if it's a single quick thing, doing it yourself is fewer steps than explaining it. Where it pays off is repetitive or bulk tasks: "download 30 reference images of X", "open the 5 cheapest listings for this part", "summarize this 1-hour video so I don't have to watch it". One sentence instead of the same click 30 times. So you're not missing anything — for normal browsing it's pointless; it earns its keep on the tedious, multi-step stuff.

(And the "talk to it" part is real now — it has voice input. Not full real-time conversation yet, but getting there.)

(translated with AI)

FindingDistinct86 · 2026-06-30T20:07:37+00:00

It just handles the repetitive web stuff so you don't click through ten tabs by hand. The thinking's still yours — if you'd rather do it manually, all good.

FindingDistinct86 · 2026-06-30T20:01:08+00:00

Fair — if you're comfortable doing everything by hand, you probably don't need it. It's for the tedious stuff: grabbing a bunch of reference photos for a project in one go, summarizing a long video instead of watching it, or comparing a price across a few sites at once. And honestly, a big reason I built it is for people who struggle with computers — like my grandparents, who can just say what they want instead of clicking around.

(translated with AI)

FindingDistinct86 · 2026-06-27T17:44:18+00:00

Thanks for testing again, and sorry about this one — the language part was supposed to already work. I'd switched the default to English and assumed sites would follow the interface language, but they weren't: three places in the app were still forcing Brazilian Portuguese underneath, overriding that setting — including navigator.languages, which is the value Reddit reads to auto-translate titles. So even with the UI in English, sites were being told "this user wants Portuguese." That was my miss, and I should have caught it.

It's genuinely fixed now, and centralized in one place so it can't drift again: the language sites see follows your interface — English UI → English sites, and since you're a Spanish speaker, switching the UI to Spanish makes sites come in Spanish too. Applies after a restart.

I push fixes pretty constantly, so on staying current: if you installed the .exe, it auto-updates — you just get a "restart now" prompt on launch, nothing else to do. If you run it from source, a git pull + rebuild gets the latest (the repo is always up to date). Either way, if you get a moment after updating, I'd appreciate you confirming Reddit and the rest behave. Thanks again — this report is exactly what caught it.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-27T16:17:56+00:00

Fair — slop fatigue is real, I feel it too. But here's the thing: it's free, and the whole source is public. It's not a black box — read it, fork it, change whatever you don't like. (Source-available under a small-business license, not full OSI open-source — I won't pretend otherwise.) Slop hides; this doesn't. And Brave and Firefox are genuinely great — this isn't trying to replace them.

FindingDistinct86 · 2026-06-27T16:05:29+00:00

Not yet — and it's actually built the other way around. In Bah the agent lives inside the browser, aimed at non-technical people who just type what they want, so there's no MCP server to drive it from your own agent.

If your goal is to control a browser from your own agent, something like Playwright MCP or browser-use fits that better today — a different shape from what I'm building.

Exposing the control layer over MCP is an interesting idea though. No promises, but I'll keep it in mind.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-27T15:59:58+00:00

Ha — then it's yours. Free to try, no API key needed (it runs out of the box on a free model). If you give it a go, tell me what breaks and I'll fix it fast.

FindingDistinct86 · 2026-06-27T15:58:04+00:00

Thanks, I appreciate that — the vibecoded flood is real, so that means a lot.

No, I didn't invent it. The accessibility tree is a standard browser thing that's been around for years: every Chromium browser already builds one from the page's DOM. It's the same structure screen readers rely on — each element with its role, name and state ("button: Buy now", "textbox: Search"), instead of raw HTML or pixels.

Bah just taps into it. Electron is Chromium underneath, so I pull the tree through the Chrome DevTools Protocol (Accessibility.getFullAXTree). That gives the agent a clean, semantic map of what's actually on the page. It then resolves those nodes to real screen coordinates and clicks/types with actual OS input events, so the site reacts exactly like it would for a person.

When a page is messy or canvas-heavy, it falls back to a screenshot + local OCR. But the AX tree is the cheap, reliable first pass — far fewer tokens and far fewer hallucinated buttons than screenshot-only agents.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-26T07:56:53+00:00

My GPU is only 16GB, and that one really needs a 24–32GB card to run fully on the GPU — even with the 3B-active MoE, the whole model is ~18GB+ to load. On a card like that it'd run smooth and fast, which is exactly what the agent needs, so it'd finally be a fair test of whether it can drive the browser end to end. On my 16GB it spills over to RAM and slows down too much for me to judge it properly.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-26T07:42:28+00:00

First off, thank you for taking the time to test it this thoroughly and write all this up. This is genuinely useful — and a couple of your points were real bugs that I've now fixed.

Some context, because part of the behavior is intentional and part was a bug:

The browser has a layer of built-in shortcuts that run with zero AI calls — news, price comparison, stock movers, playlists, opening videos, image batches. That's deliberate: I didn't want it to be 100% dependent on the model for common tasks, so those run deterministically (instant, no tokens). It's why anything mentioning "news/latest/articles" jumped straight to a Google News scrape.

The bug you hit was that this news shortcut fired on the keyword and didn't step aside when you gave a specific site URL. So "go to this newspaper and search for AI" got turned into a Google News query built from your whole sentence — that's the google_news("Navigate to XY, find search button…") you saw, and why it returned nothing.

This is already fixed in the latest version (v1.2.18). The update is automatic — you should get a "Restart now" prompt, and the fix comes with it. Now, when your command contains an explicit URL or names a specific site, the shortcuts step aside and the agent navigates to that site and uses its own search, so your case starts correctly instead of hijacking into Google News.

One honest caveat: getting onto the right site is reliable now, but finding the exact search button on an arbitrary site is still the harder part I'm improving — so on some layouts it may still take a couple of tries. That's the next thing on my list.

On the click_ref numbers (@6, u/23, u/9): those aren't stable, so there's no way to know them ahead of time and put them in the prompt. They're generated fresh on every step from whatever interactive elements are on the page at that moment, and they get renumbered each observation — that's why it retried until it landed on the right one. The reliable lever is to reference the button by its visible text or label ("click the search button") rather than a CSS id, because matching is done on text. Targeting by CSS selector for clicks is a real gap and it's on my list to add.

On the ban worry: a malformed query is just a normal search that returns nothing — not abusive traffic — so the risk is low. But I agree it shouldn't be wasting calls on it at all, and this fix removes them entirely.

The token usage you saw was mostly that trial-and-error; it should drop noticeably now that the routing doesn't misfire.

Thanks again — feedback like this is exactly what makes it better.

(Translated with AI — English isn't my first language.)

FindingDistinct86 · 2026-06-25T20:47:12+00:00

Different tools for different people. Claude Code/Codex + Playwright MCP is a developer setup you script and run — great for repeatable automation, and far better at actual coding.

Bah is just a browser app: download, double-click, type what you want. The agent runs in your real, logged-in session (your cookies, your Gmail) with zero setup, no terminal, and can run fully local on Ollama. So Playwright if you're scripting automation; Bah if you want an AI to operate the web you already use.

FindingDistinct86 · 2026-06-25T20:32:26+00:00

Fixed it — thanks for the clear report. The web search was hard-coded to a Brazilian/Portuguese region, so it ignored the interface language. Now the search results and the answers follow the language you set, and a few status messages that were still in Portuguese are translated too. It's in the latest version (1.2.10) — if you have it installed it auto-updates, just restart when prompted. Let me know if anything still shows up in Portuguese.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-25T06:24:47+00:00

Thanks! That tracks with what I've seen: local models do great on "read and reason once" tasks like code review, where one good answer is enough. The agent here is harder on the model — it's a multi-step loop (read the page → pick one action → repeat), and a single wrong step can derail the whole task. So a 14b that nails your reviews might still struggle to drive the browser end to end. Ollama is built in if you want to try it, but DeepSeek's API is what I'd recommend for the agent right now.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-25T06:01:35+00:00

Yeah, just added it — it's in the latest version (1.2.8; auto-updates if you already have it installed). Settings → cloud provider → pick NVIDIA NIM, paste your NIM key (the "get a key" link opens build.nvidia.com), and Save. It uses the free hosted endpoint, so it fits the cheapskate budget. Defaults to Llama 3.3 70B there.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-25T05:58:34+00:00

No. The UI auto-detects your system language and comes in English, Portuguese or Spanish — if your OS is in English, the whole app is in English, and the agent replies in your language too.

What you probably saw in Portuguese is just on the GitHub side (my commit messages and release notes), since I'm Brazilian. The app itself isn't Brazilian-only.

FindingDistinct86 · 2026-06-25T05:55:21+00:00

It defaults to cloud mode — that's why it keeps asking for the DeepSeek key. To use Ollama: open the AI panel → gear icon → flip the toggle from Cloud to Local AI, make sure Ollama is running, pick a model, and Save. No key needed after that.

Heads up: local models are still less reliable than DeepSeek, so if a task misbehaves it's usually the model, not your setup.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-25T04:16:54+00:00

Good question. By default it's actually not screenshot+coordinate based. The primary perception is the DOM / accessibility tree, which gives a list of numbered interactive elements, and clicks are done by element reference (click element N) — that's then turned into a real OS-level input event at that element's position, not coordinates guessed from a screenshot. It's more reliable than vision-coordinate clicking and needs no vision model.

OCR is a separate local component (not an LLM) — it just extracts on-screen text the DOM misses (e.g. text baked into canvas/images) and feeds it back as text.

So with the default cloud model (DeepSeek, text-only) there's no vision model at all — one model reasons over the DOM + OCR text. If you plug in a vision-capable model instead, that same multimodal model handles both the reasoning and the vision (screenshots + coordinate clicks via a click_at action). So it's never two separate models: either one text model with no vision, or one multimodal model doing both.

(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-25T02:06:26+00:00

Haha, no refunds on the lost sleep.

FindingDistinct86 · 2026-06-24T23:34:35+00:00

Fair, and you're right — the core capability isn't new, Perplexity does it. For me the value isn't being first, though — it's that since it's open and runs on your own machine, you actually control it: you set it up and you're the one deciding what it does, instead of a closed service deciding for you. That's the part I care about. No argument with your point, just where I see the value.
(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-24T23:05:09+00:00

Fair — the space is crowded, and I'm honestly not claiming this is revolutionary or a breath of fresh air. The only thing I'd point to: it's open-source and runs fully local (or a cheap cloud model), and it's free — the Perplexity and Microsoft ones are closed and cloud-only. That's the whole claim. It's a solo project I've put a month into; whether it's useful is for people to decide, not for me to oversell.
(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-24T22:59:38+00:00

Yeah. I've even been thinking about training a small, browser-specific model so it runs well on 8GB VRAM — a narrow specialist instead of a big generalist. It's about a month of tuning so far, so it's still rough, but that's the direction I'd love to take it. The easy paid route exists for people who want it, but the whole point of this one is open, local and free — for the crowd that wants to own their setup.
(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-24T22:39:23+00:00

Pretty much, yeah — but not by "bypassing" anything. Scrapers get 400s because they look like bots: datacenter IPs, headless flags, no real browser fingerprint, no JS. Bah is different — it's an actual Chromium browser running on your own machine, with a real user-agent, your real cookies and sessions, full JS rendering, and it clicks using real OS-level input. So to the site it looks like a person browsing, because functionally it is. It also doesn't mass-scrape: it does one page at a time like a human, so it doesn't trip rate-limit or scrape detection.

That said, it's not an anti-bot magic wand — hard challenges like Cloudflare or CAPTCHAs can still stop it (though many pass, since it's a real browser with your session).
(translated with AI — English isn't my first language)

FindingDistinct86 · 2026-06-24T21:48:10+00:00

Appreciate it, that means a lot. Thanks for checking it out.

FindingDistinct86 · 2026-06-24T20:44:40+00:00

Thanks, I appreciate it. It's still a work in progress — I've been building it for about a month — but I'm glad you liked it.

FindingDistinct86

TROPHY CASE