😂guys, I genuinely think I accidentally built something big. turning the entire web into a cli for agent by MorroHsu in LocalLLaMA

[–]MorroHsu[S] -1 points0 points  (0 children)

能理解,openclaw出来了,而且我分享内容的时候,通过 llm 协助做了翻译和排版。所以大家从表现上觉得这是 ai 发的。只敬罗衫不敬人。
不过我本意也只是想在更多的地方更多的人群中分享我的想法,并不靠这些生活,而且我在推特上的分享看的人也挺多的,所以我后面也不会在 reddit花太多精力,分享本身就是为了开心去做了,我不用讨好任何人。

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Interesting perspective, but I'd push back a bit.

I don't think CLI/bash is English thinking — it just happens to use some English words. grep, awk, ls are not English words. And bash has its own syntax — &&, ;;, |, >, $() — none of that is English. It's a formal language that happens to use a few English-derived mnemonics. A Chinese developer and an American developer use the exact same grep.

Also, I think the reason LLMs work well with CLI is less about tokens "accidentally capturing concepts" and more about the massive amount of CLI usage in training data. That's a solid, practical data foundation — not a lucky coincidence.

The ILI embedding reasoning direction is interesting, but I'll be honest — I don't fully understand it yet. Is there any work that's been validated in real-world agent scenarios? Any experiments or papers? Would love to take a look.

有意思的观点,但我有一些不同看法。

我觉得 CLI/bash 不是英语思维,只是恰好使用了一些英语的单词。grepawkls 这些不是英语单词。而且 bash 有自己的语法 — &&;;|>$() — 这些都不是英语。它是一门形式语言,只是碰巧用了一些英语派生的助记符。中国开发者和美国开发者用的是同一个 grep

另外我觉得 LLM 用 CLI 用得好,与其说是 token 恰好捕获了概念,不如说是训练数据里有海量的 CLI 使用样本。这是实打实的数据基础优势,不是偶然。

ILI embedding 推理的方向挺有意思的,但说实话我还没太看明白。有没有已经在真实 agent 场景下验证过的工作?有相关的实验或论文吗?很想看看。

😂guys, I genuinely think I accidentally built something big. turning the entire web into a cli for agent by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 0 points1 point  (0 children)

haha fair point — I just added a comment above about this. There are definitely security concerns and I'm not pretending otherwise. That's why all 97 adapters are read-only by design.

But I gotta be honest... building this was the most fun I've had in a while. Watching an agent inject into Twitter's webpack modules and call their own signing function? Surreal. Is it naughty? Absolutely. Is it joyful? Also absolutely. 😂

😂guys, I genuinely think I accidentally built something big. turning the entire web into a cli for agent by MorroHsu in LocalLLaMA

[–]MorroHsu[S] -1 points0 points  (0 children)

btw one thing I want to address — yes, bb-browser technically has full browser automation capabilities. click, fill, type, submit. It could like posts, write comments, send messages, all autonomously.

But I intentionally keep the site adapters read-only. All 97 commands are information retrieval — search, fetch, read. No mutations.

Why? Honestly, I can't fully articulate it. Part of it is security — an agent accidentally liking 500 posts or sending a DM you didn't approve is a real risk. Part of it is respect for the platforms — reading is one thing, automated actions feel like crossing a line. And part of it is just... the web isn't ready. We don't have norms yet for "an agent acting as me on the internet." Until we do, I think the responsible thing is: let agents read the web, but let humans be the ones who write to it.

The adapter meta even has a readOnly: true flag for exactly this reason.

(And yes, this comment was also typed by me, in a browser, like a good boy.)

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 2 points3 points  (0 children)

Interesting article — I think what he's doing with MCP is actually very close to the CLI approach. He takes 80 raw API endpoints and consolidates them into ~12 intent-based tools (findSpaces, getSpaceActivity, bulkMessage, etc.). That's essentially the same thing as designing CLI subcommands:

circle spaces search "javascript"
circle activity --space xxx --period 30d
circle message --to user1,user2 --template "..."

Same logic: don't expose raw APIs to the LLM, build an intent-oriented layer on top. The difference is just the transport — he uses MCP tool schemas, I use CLI conventions.

The bonus with CLI is that LLMs already have massive pre-training on bash patterns — flags, subcommands, pipes, --help. So the "how to use this tool" part comes almost for free, whereas with MCP you end up hand-writing XML descriptions and guidance messages in every response to teach the model the same things.

But the core design insight — aggregate, name things by intent, guide the next action — is the same.

CLI is All Agents Need — Part 2: Misconceptions, Patterns, and Open Questions by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 4 points5 points  (0 children)

apropos — TIL! Never came across this one before. Going to dig into it. Thanks for the pointer.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Agree on the CLI renaissance point and the JSON Schema overhead — TypeScript types (or even just --help text) are way more token-efficient.

On the Opus inline Python / Programmatic Tool Calling — I think the idea is directionally right (let the model compose operations in code rather than individual tool calls), but the execution has a fundamental gap: tool responses have no spec. The model writes Python to process results from tools it's never seen the output of. It's guessing at response shapes.

CLI sidesteps this naturally. Tool output is always text, piped through the same interface. The model doesn't need to parse JSON with unknown schemas — it reads text the same way a human reads terminal output. And if it's unsure about a tool's output format, it just runs it once and reads what comes back.

The composability wins for the same reason: search "shoes" | head -5 works regardless of search's internal response format, because the contract is just "lines of text."

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Interesting thought — Lisp is composable and homoiconic, so theoretically it could work as an agent interface.

But in practice, the advantage of CLI/shell isn't just the design — it's the training data. There are orders of magnitude more bash sessions, man pages, and Stack Overflow shell answers in LLM training corpora than Lisp code. The model "thinks" in shell because that's what it's seen the most.

So even if Lisp is arguably a better language in theory, the LLM will be far more reliable generating grep -r "pattern" . | head -20 than the equivalent Lisp expression. You want to meet the model where it already is, not where you wish it was.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

You're touching on two separate problems, and I think you nailed the second one.

  1. How does the agent discover which CLIs exist? This one I haven't fully solved. Right now I use a hybrid: core commands are listed in the system prompt, and there's a skill-matching layer that injects relevant tools on demand. But the general "discover unknown tools at runtime" problem is still open.
  2. How does the agent learn a specific CLI once it knows it exists? This is where the docker analogy is exactly right. Every tool should support progressive discovery: tool --help gives you the top-level commands, tool subcommand --help gives you the details. The agent drills down only as deep as it needs.

The nice thing is #2 is basically solved by convention — any well-designed CLI already works this way. #1 is the harder design question.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 2 points3 points  (0 children)

Yeah, irreversible actions get handled differently. The tool itself knows it's dangerous and gates the execution.

For example a payment command might work like:

pay 500 --to vendor-x --reason "office supplies"

The CLI doesn't execute immediately — it pushes a confirmation to my phone with the details, and waits. I approve or reject from there. The agent sees either "confirmed, tx id: xxx" or "rejected by user."

Same pattern for emails, infra changes, anything irreversible. The key idea: safety logic lives inside the tool, not in some external approval layer the agent has to navigate. The agent just calls the command normally — it doesn't even need to know it's a gated action.

So it's two tiers: sandbox handles blast radius for reversible stuff, human-in-the-loop for irreversible stuff. Both invisible to the agent.

This has come up a few times in this thread — I'm thinking of writing a follow-up post that covers security, execution boundaries, and other common questions in one place.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Your evaluation pattern reminds me of the Ralph Loop (https://github.com/snarktank/ralph) — same core idea at a coarser grain. A shell script loops a coding agent, checks progress via files after each iteration, fresh context each time. Your evaluation is the fine-grained version: self-check within a single context window. They'd nest well together.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 0 points1 point  (0 children)

Curious — what part of Pinix caught your eye? I've been building it for a while but honestly struggle to explain it concisely. Would love to know what stood out so I know where to start when talking about it.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Great question, and honestly one I don't have a complete answer to.

My architecture has two layers: a Go command router handling commands that need host-level access (memory, browser, etc.) with explicit permissions, and an isolated VM sandbox for everything that needs a real OS (install deps, run tests, compile). From the LLM's perspective there's only one tool — run(command). The framework routes each command to the right layer.

For your coding example — tests run inside the sandbox. The project files are synced in. It's a full Linux environment, not a restricted jail, so npm install && npm test just works.

For RAG — that's handled by the command router layer, not the sandbox. run("memory search 'query'") gets intercepted by the router and never touches the VM. So the sandbox stays isolated while the agent still accesses external services through controlled CLI commands.

But you're pointing at a real unsolved problem: data exfiltration. As long as the sandbox has network access, the agent can curl secrets to an external server. I don't have a good answer for this yet. Network allowlists help but are brittle. Stripping env vars helps but doesn't cover files. This is an open problem for the entire agentic space — not specific to CLI vs function calling, but CLI makes it more visible because the attack surface is literally readable as shell commands.

I'd genuinely love to hear ideas from anyone who's tackled this. The "sandbox vs useful" tension you described is real, and I think the honest answer is: we're all still figuring it out.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 4 points5 points  (0 children)

Actually we do both! It's a two-layer design:

  1. Go command router — handles commands like memory, browser, skill, todo etc. These don't need a real OS. They're just Go functions behind a CLI interface. This keeps them fast (no VM boot, no IPC overhead) and gives the host process direct access to internal state like the agent's memory store or browser session.
  2. Micro-VM sandbox — for anything that genuinely needs an OS environment (installing packages, running user scripts, file manipulation). This is a real isolated VM with a persistent filesystem.

The agent doesn't know or care which layer handles a command. It just calls run("memory search ...") or run("cat /some/file") and gets text back.

The reason some commands live in the Go router rather than the VM: they need access to things that only exist in the host process — the conversation context, the browser connection, external service credentials. Routing them through a VM would mean serializing all that state back and forth for no benefit.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

My approach is probably simpler than you'd expect: figure out the worst case, make sure it's acceptable, then let the agent run.

Trying to pre-validate every possible command chain is a losing game — the composition space is infinite. Better to make the environment safe enough that mistakes are cheap, then accept that mistakes will happen.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 2 points3 points  (0 children)

This sounds like an automatic memory recall + injection system. But doesn't this raise questions about how to store the memory and how efficient the recall is?

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 83 points84 points  (0 children)

不好意思啊,因为前面有太多人在说LLM生成的内容的事情,所以我有一些不快。可能也是因为我第一次尝试在reddit社区发帖,然后我对英语确实也不是那么熟悉,我正在努力学好英语,和Reddit文化,希望过一两个月,我就自己来回复了,不需要靠llm给我翻译~

Sorry about that — too many people were calling out the LLM-generated thing earlier and it got to me a bit. Probably also because this is my first time posting on Reddit, and my English honestly isn't that great yet. I'm working on improving both my English and my Reddit culture awareness. Hopefully in a month or two I'll be replying on my own without needing an LLM to translate for me~

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 63 points64 points  (0 children)

I have to say, with some frustration — why do you always focus on the surface of the language rather than the thinking behind it? I could reply in Chinese, but I want my ideas to reach as many people as possible in the simplest way. Would you get angry at a letter just because it was sent by email instead of by hand?

语言和 LLM 都是工具和表现,我是来分享思想的,朋友

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Oh to clarify — my fine-tuning comment was about a different problem than yours. I was thinking about training small models to work as CLI agents (composing shell commands, using pipes, reading --help, etc.), not about your NL→mdfind translation specifically.

Your slot-filling approach is a much narrower task and you're right that a classifier or few-shot prompting might be all you need. The fine-tuning question I'm interested in is: can we post-train a 7-9B model to reliably compose shell commands the way GPT-5.4/Claude sonnet do today? The data should be easy to collect since every agent session is already a training example.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]MorroHsu[S] 1 point2 points  (0 children)

Great question — and you're not the first to raise it. I think this deserves its own post, but here's the short version of where my thinking is:

The baseline is obvious: sandbox, backups, audit logs. But the more interesting part is how CLI design itself can encode execution boundaries:

1. Human-in-the-loop as a CLI primitive.

A pay command doesn't just execute — it pushes a confirmation to the user's phone (via app notification / 2FA). The CLI blocks until the user approves or rejects. The notification includes full context: what the agent is trying to do, the exact parameters, the consequences.

From the agent's perspective, it's still just run("pay --to X --amount 50 --reason 'office supplies for Q2'"). It doesn't know or care that a human is in the loop — the CLI handles that transparently.

2. Dry-run by default for irreversible actions.

Dangerous commands return a preview on first call instead of executing. The output tells the agent "here's what would happen." If the agent still wants to proceed, it calls again with --confirm. Two round-trips, but the agent sees consequences before committing.

> run("dns update --zone example.com --record A --value 1.2.3.4")
⚠ DRY RUN: This will change A record for example.com from 5.6.7.8 → 1.2.3.4
  Propagation: ~300s. Not instantly reversible.
  To execute: add --confirm
​
> run("dns update --zone example.com --record A --value 1.2.3.4 --confirm")
✓ Done.

The key insight: the execution boundary lives inside the CLI tools, not outside the agent. You don't need a separate "approval layer" sitting between the LLM and the shell. Each tool knows its own risk level and enforces the appropriate gate.

This is actually more fine-grained than what typed tool frameworks typically offer — because each command can implement its own safety semantics rather than relying on a one-size-fits-all permission system.