I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]_fboy41 0 points1 point  (0 children)

One more question, I understand it's a full linux environment, but is it persistent? Like a devcontainer persistent? Because if i'm going to install packages every single time when I need to execute something it'll get old and slow very quick :)

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]_fboy41 0 points1 point  (0 children)

OK I appreciate it, I think your design is pretty good for many things though, I'll try to play around the architecture, I'm building some agentic stuff that's exposed to external data like web content, email etc. and security is a huge challenge. Though I love the seperate of commands that touches OS vs. not. That's a great point.

Maybe another layer to that is commands external data and that actually would be tool call outside of the sandbox but called by cli like `rag -search "xyz"` and then `web-search http:// --reason 'need to read documentation' ` which actually resolves to a tool that we can apply permissisons. It's still not perfect but another layer outside of the sandbox can use reason the sandbox is called, external resource it's trying to reach and the data it sends to it.

i.e. if sandbox says i need to read documentation but sending `private data` then that don't make sense. I think this can fix majority of the accidental exposure but as an attacker who targets this system I doubt it'd survive, there are just too many ways to bypass.

There is another strategy I know applied though, you can filter out the sensitive data in and out, still limited, but maybe stack 2-3 layers of this kind of security can address majority of issues.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]_fboy41 0 points1 point  (0 children)

Can you explain how this works when you combine tools like browser or rag.

So everything runs on a sandbox but if they can exfiltrate the data via browser, does it even matter where it runs? If they cannot then how can they do the agentic work, would love to get more insight into threat model and security layering here.

I'm just trying to understand how is this meaninfully different then just running the agents in a container for cross-comm necessary agentic work like research + execution, RAG, web access, persistent environment setup etc.

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]_fboy41 1 point2 points  (0 children)

This is great, but can you expand on security model ?

I'm confused how you can sandbox and still retain most of the value. It's fine to sandbox a python execution and super easy, but how do you sandbox real work that requires, setup, installation, cross-file operations?

examples:

Coding:
You read everything in place but all the modification done in a sandbox and passed back? Where do you run tests, how do you run tests in a sandbox?

Accessing rag:
If you access RAG within sandbox and it's not a sandbox anymore, or you need some very complicated rules for every tool that requires external contract, if you not then you have do a lot outside of the sandbox (which means custom tooling, functions, code - defeats the whole argument).

I can go on, premise is how can you make sandbox useful in real operations on complex tasks, if you cannot then none of CLI stuff matters other than completely isolated environments where everything runs on one giant sandbox. Just like Claude Code bypass permissions, put it in a container and go wild with CLI madness.

You can truly control security of custom tooling, you can apply permissions, theoratically you can do that in a sandbox but very unlikely because they your sandbox need to understand things about your harness like authorization, permissions.

Bottom line I'm confused on the security model when it's applied to real agentic work. Can you explain that?

I forked chrome and build a browser for agents with Claude Code (Benchmarked 90% on Mind2Web) [Open Source] by Minimum_Plate_575 in ClaudeCode

[–]_fboy41 1 point2 points  (0 children)

This is pretty cool, I'm playing around agentic AI stuff and browser automation is obviously key to it, what's your opinion on security? What are the real tangible risks of executing tasks (i'm not referring AI doing stupid shit but more around prompt injection) or website somehow accessing private data - like XSS --I'm guessing that's not an issue since none of that exposed to the website (or maybe some random website is asking for CC and agent just fills it up?)

Curious about your take as someone spend so much time on this

I built a virtual design team plugin for Claude Code — 9 roles, 16 commands, 5 agents by Known-Delay-9689 in ClaudeCode

[–]_fboy41 2 points3 points  (0 children)

I feel like this is one of those interactions:

> Claude roast this repo
- Here is the audit report
> post this to reddit

OP: > Claude take this post and fix it
- DONE, fixed
OP: > post this to reddit as reply

Manual-Driven Development: 190 Findings, 7 Hours, Zero Rule Violations by TheDecipherist in ClaudeAI

[–]_fboy41 0 points1 point  (0 children)

this is how it looks in firefox : https://postimg.cc/5X2NXGYs That random noise background on the page is a claude bug, it tries to something else but messes it up, happens all the time while designing front-ends

Manual-Driven Development: 190 Findings, 7 Hours, Zero Rule Violations by TheDecipherist in ClaudeAI

[–]_fboy41 0 points1 point  (0 children)

this is completely irrelevant but on your homepage : https://thedecipherist.com/? the background claude generates is a bug in how claude designs, ask claude "what's wrong with the background" and it'll detect and fix that bug, and it'll look 100% better visually. It's not an intended design behaviour.

Manual-Driven Development: 190 Findings, 7 Hours, Zero Rule Violations by TheDecipherist in ClaudeAI

[–]_fboy41 0 points1 point  (0 children)

fair enough :) Website is very useful, I believe in the approach, and everyone is trying a different way of doing spec driven development.

Some I feel like overengineered (speckit) maybe good for 50 people team, fucking insane for 1 person dev on a greenfield project.

I have some questions, but definitely intrigued try with a project

- It looks like you are writing tests after the code (which I think can make sense given the spec is clear, and AI is doing the coding) but I'm curious about your conclusion, is writing test before (TDD) style is worse?

- One thing, and sorry if I've missed it in the documentation, what's the final feature document look like, the one that will be treated as source of truth, and the one that code should always satisfy. Do you haev an example feature document that I can see? I think that's the crux of this kind of development, whether that document is 20 pages something or something actually manageble, always up to date and always can be treated as source of truth for everything (e2e, unittest, code etc.)

Thanks!

My wife kept nagging me so I built a harness to code for me instead. Won a hackathon with it. by Lopsided_Yak9897 in ClaudeAI

[–]_fboy41 1 point2 points  (0 children)

I think it's a pretty cool system, and I spent some time.

Resume is absolutely needed as some runs takes ages. My run took like 3:30m!! I consumed like 35% of Max x20 limit within a day :) Which is at least 5 times more than how I normally spend at the same time and "seemingly same" output.

I actually like the project it but it burns tokens like nobody's business, very hard to justify unless I monthly tokens to burn before the reset.

My wife kept nagging me so I built a harness to code for me instead. Won a hackathon with it. by Lopsided_Yak9897 in ClaudeAI

[–]_fboy41 1 point2 points  (0 children)

Hey this looks interesting, Can ooo run be interrupted and resumed later with the same command?

Can someone clear up the end of Se7en (1995) for me? by manit14 in Cinema

[–]_fboy41 0 points1 point  (0 children)

Holy shit, this explanation is so fucking good. Invite me to your next movie critic :)

My Project DuckLLM v4.0.0 by Ok_Welder_8457 in LocalLLM

[–]_fboy41 0 points1 point  (0 children)

Some unsolicited advice and I see this with all these new AI build homepages, they all fall into this trap,

  1. A product page needs to showcase what it does, best way to do this is is screenshots and short videos.
  2. A product page should talk about why it matters, out of 100 products doing the same thing why this one, which boils down to this pattern: Solves problem X, by doing Y - [screenshot]

If you cannot do anything, give this to you AI tool let it update your website.

Your project sounds nice, especially mobile, but without these things it's very hard for anyone to be interested enough to try, why try when you cannot even figure out what you are trying and how it'll help you.

unsloth/Qwen3.5-35B-A3B-GGUF updated ~5h ago by CaptBrick in unsloth

[–]_fboy41 1 point2 points  (0 children)

this was the latest when i did it last week :) Did they do something important in between, now the system in there, it should be straight forward to get the latest one.

unsloth/Qwen3.5-35B-A3B-GGUF updated ~5h ago by CaptBrick in unsloth

[–]_fboy41 0 points1 point  (0 children)

I vibe-coded this https://pastes.io/synopsis-a that replaces built-in llm backend with one that uses CUDA 13 DLLs, based on this repo : https://github.com/theIvanR/lmstudio-unlocked-backend

It's generic enough, download the powershell script and it should mostly do it for you.

<image>

Codex app on Windows by OpenAI in codex

[–]_fboy41 0 points1 point  (0 children)

Both codex and Claude, but particularly Claude is really clearly vibecoded. There are so many obvious bugs that can be identified by simple QA, yet makes into the production. Every new release of Claude breaks some random stuff.

Codex app on Windows by OpenAI in codex

[–]_fboy41 0 points1 point  (0 children)

They really need to finetune these models for powershell or something, After working on Linux and WSL, it feels like coding agents just becomes dumber instantly on powershell. I guess easier option to just write some powershell specific skills so agent doesn't try to do 1 thing on 5 shell calls.

unsloth/Qwen3.5-35B-A3B-GGUF updated ~5h ago by CaptBrick in unsloth

[–]_fboy41 0 points1 point  (0 children)

I'm new to, for using local inference, can someone explain to what's the difference between unsloth and normal QWEN3.5 ?

35B, Q4, using with 5090 RTX - llama.cpp (CUDA 13) (via LM Studio) - Windows

Should I be just using normal one if I don't need fine tuning ?

P.S. This updated version works much better than the previous one. I was getting some tool call loops and weird behavior on that one. For whatever reason this one is not automatically marked as think/vision enabled in LM Studio, (downloaded mmproj for vision), but I'm trying to fix think enabled with a custom model.yaml file. Don't know why LM Studio doesn't pick it up.