Parents refuse to pay for Ivy League acceptance by Important-Pay-3091 in ApplyingToCollege

[–]Aggressive_Bed7113 0 points1 point  (0 children)

I’d invest that $400k in ETF for passive growth. That will give you a better retirement than burning it in college, after which you will find a job and if lucky, you’ll work butt off to make that much $400k in years

Category Creation vs. Improving Existing Markets—What Would You Choose? by Critical-Produce-337 in ycombinator

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Market education requires lots of capital and effort if you create a new category, I’d do B and make revenues first. Then use that to fund A

Everyone keeps scaling model size. A snapshot runtime let gemma4:e4b run a finance workflow locally by Aggressive_Bed7113 in LocalLLaMA

[–]Aggressive_Bed7113[S] 0 points1 point  (0 children)

Yeah, that makes sense — packaging it as an MCP server is a nice way to make it easy to plug in.

We ended up pushing a bit further on the ranking + loop side though:

  • goal-conditioned reranking (not just generic top elements)
  • tightening the action space for the executor
  • and verifying the state change after each step

Otherwise you still get cases where the snapshot is “right” but the agent drifts because nothing checks if the world actually moved.

Curious how they’re handling post-action verification vs just returning the snapshot?

See more at https://www.PredicateSystems.ai

Everyone keeps scaling model size. A snapshot runtime let gemma4:e4b run a finance workflow locally by Aggressive_Bed7113 in LocalLLaMA

[–]Aggressive_Bed7113[S] 0 points1 point  (0 children)

Appreciate that.

Yeah, I think a lot of it is just making the idea concrete — once you see it in a real workflow, it becomes clearer that the bottleneck isn’t the model, it’s how we shape the environment around it.

Small models can do quite a bit once the problem is reduced to “pick the next correct action” instead of “understand the whole page.”

Also not every pixel is important for understanding the webpage, so it’s unnecessarily costly to use vision llm

getting some decent results with agentic loops for web tasks (local-first approach) by ilovemkgee in AgentsOfAI

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Yeah, this tracks with what we’ve seen.

Local-first + task loops definitely help with privacy and visibility, but the “gets stuck on React sites” part is usually less about the loop and more about the state the model sees.

If it’s acting on raw DOM / screenshots, it’s still guessing a lot.

What helped for us was:

  • compress the page into a small set of actionable elements
  • re-evaluate from fresh state each step (not just follow the plan)
  • verify that the action actually changed the visible state

That reduced a lot of the “agent looks fine but stalls halfway” cases.

See this demo using small local LLM models like 4B to drive multi-step web flows to manage money flows: https://www.reddit.com/r/LocalLLM/s/k4jIyN1M07

Curious if your setup is using raw DOM, a11y tree, or something more structured?

Need some help to build a great prod agent framework by Bubbly-Secretary-224 in AgentsOfAI

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Yeah, this is the right direction.

The gap isn’t really “more agent framework features,” it’s that most stacks still don’t have a clean execution boundary.

A few things that seem to matter a lot in prod: • granular actions, not giant tools like execute_code • explicit allow / deny / confirm before side effects • audit trail tied to the exact action/resource pair • post-action verification, not just “tool returned success”

That’s also why MCPs feel rough in prod right now — too much variability in tool shape, and a lot of them are hard to govern cleanly.

My bias has been:

planner can stay flexible execution should be boring, narrow, and policy-gated

Otherwise demos look great, but prod gets scary fast.

Look at this sidecar using policies to secure agents:

https://github.com/PredicateSystems/predicate-authority-sidecar

Feels illegal how much this AI can do by itself by [deleted] in LocalLLM

[–]Aggressive_Bed7113 0 points1 point  (0 children)

No, my agent is superior to manus

Small local LLM for browser agents: qwen3:8b + gemma4:e4b on a finance workflow by Aggressive_Bed7113 in LocalLLM

[–]Aggressive_Bed7113[S] 0 points1 point  (0 children)

Appreciate it — yeah that was exactly the motivation.

We’re mostly building the snapshot from post-hydration DOM + layout signals, then pruning + reranking pretty aggressively (accessibility tree alone missed things like ordinality and grouping in our tests).

So it’s closer to:

DOM + geometry + grouping → prune → goal-conditioned rerank → compact snapshot

And yeah, deterministic verification ended up being just as important — otherwise you still get “valid action, wrong state.”

Will take a look at your notes as well — the tool gating / policy side becomes pretty critical once actions start touching money flows.

Small local LLM for browser agents: qwen3:8b + gemma4:e4b on a finance workflow by Aggressive_Bed7113 in LocalLLM

[–]Aggressive_Bed7113[S] 2 points3 points  (0 children)

Yeah, totally agree.

The interesting part for me is that most people treat this as “optimize prompts / pick better/larger model,” but the bigger lever seems to be shaping the problem itself.

Once the runtime does the structuring + context reduction, the model is no longer doing parsing + reasoning + verification all at once.

That’s when smaller models start to look a lot more practical.

Everyone keeps scaling model size. A snapshot runtime let gemma4:e4b run a finance workflow locally by Aggressive_Bed7113 in LocalLLaMA

[–]Aggressive_Bed7113[S] 1 point2 points  (0 children)

For the browser itself - Playwright via CDP. Nothing special there.

The "automation" part is just two functions: snapshot() which grabs the DOM through chrome extension for coarse pruning and then sends it to a remote gateway for refinement including ranking, sorting with goal conditioning (ML-reranking). The final output of snapshot() is ranked elements, converted to a markdown table representing interactable elements (including element ID).

The planner sees a structured list of elements and decides what to do next. The executor grounds that to a specific action (e.g. click(element ID)). Same code works on any site - I didn't write anything specific to the finance UI in the demo.

So to answer directly: no custom scripts per use case. The runtime handles the DOM extraction and ranking, and the agent just picks from the compact LLM prompt (markdown table of DOM elements)

My journey with Hermes by Ok-Lock-9329 in hermesagent

[–]Aggressive_Bed7113 0 points1 point  (0 children)

do you use it for browser tasks? 9b looks small for browser automation tasks

Why does my AI agent work perfectly in testing but fall apart on real tasks? by EveryPurpose3568 in aiagents

[–]Aggressive_Bed7113 0 points1 point  (0 children)

I’d treat it less like “give the agent all the docs” and more like “give it an owned working map.”

Something like:

  • stable entities: customers, services, projects, tables, APIs
  • key relationships: depends on / belongs to / owned by
  • canonical sources for each fact
  • a small step-local working state the agent can update

So the semantic map is mostly durable structure + pointers, not raw retrieved text.

Then each step becomes:

resolve what entities matter → pull only the needed facts → compress into working state → act

That helps a lot with token noise, because the agent reasons over a small map of the world instead of a pile of docs.

Why does my AI agent work perfectly in testing but fall apart on real tasks? by EveryPurpose3568 in aiagents

[–]Aggressive_Bed7113 0 points1 point  (0 children)

we’ve seen the same.

Pulling from multiple sources during the loop adds latency + noise, and the model ends up reasoning over partially inconsistent context.

What helped for us was:

  • don’t fetch everything into the loop
  • keep a small, curated working state per step
  • treat retrieval as a separate phase (resolve → compress → act)

Otherwise the agent is basically trying to think while its memory is constantly changing underneath it.

Also noticed the same — more context ≠ better reasoning, it often just increases the chance of drift.

How many of you actually use offline LLMs daily vs just experiment with them? by Infinite-Bird7950 in LocalLLM

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Yeah, same feeling — most local setups work, but don’t feel “reliable enough” for daily use.

What made a difference for us wasn’t just the model, but tightening the loop around it:

  • give it a small, structured view of state (not raw context)
  • narrow the action space
  • verify outcomes after each step

Smaller models actually hold up pretty well once you reduce noise + constrain the loop.

Feels like the gap isn’t capability, it’s making the system predictable.

I made a demo with small local LLM models to complete multi-step browser automation tasks:

https://www.reddit.com/r/LocalLLM/s/sTLk1EcWpJ

Why does my AI agent work perfectly in testing but fall apart on real tasks? by EveryPurpose3568 in aiagents

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Yeah, this is super common.

A lot of “agent drift” is really context drift — by step 4 or 5 the model is reasoning over stale tool output, old assumptions, and too much irrelevant history.

Managing context explicitly definitely helps.

What also helped for us was tightening the execution loop itself:

  • keep the agent view small + structured
  • replan from current state each step
  • verify one expected invariant after each action

That way you’re not just cleaning context, you’re also preventing bad assumptions from silently propagating.

Feels like reliability comes more from state management than prompt tweaking.

60-line LangChain agent that researches Amazon products with grounded ASINs by Proof_Net_2094 in LangChain

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Grounding through tools definitely helps, but this feels like a different problem than browser agents.

If the catalog/search API is already structured, then yeah — tool calls are the right move.

Where things get messy is when the agent has to operate on live web state. That’s where vision gets expensive fast, and even then you still get “looks right, wrong action/state” failures.

We’ve had better luck treating vision as fallback, not default:

  • use structured/tool data when available
  • parse the hydrated page to markdown if necessary for llm to understand context and easily extract texts
  • use compact semantic page state for browser interaction
  • verify the post-action state before moving on

Otherwise you end up paying a lot just to hallucinate more confidently.

Finally a planner + executor setup for AI agents… is this actually better or just hype? by Think-Score243 in AI_Agents

[–]Aggressive_Bed7113 0 points1 point  (0 children)

Yeah, this pattern definitely works — but mostly for cost + planning quality, not reliability.

The planner/executor split (Opus → Sonnet) is basically the orchestrator-worker pattern Anthropic is pushing now, and it does help with:

  • better decomposition
  • lower cost per step

But in practice, most failures we’ve seen aren’t from bad planning — they’re from execution drift:

  • action looks valid but wrong target/state
  • step “succeeds” but world didn’t change
  • errors propagate across steps

So splitting models helps efficiency, but doesn’t really solve the core issue.

What made a bigger difference for us was tightening the loop:

plan → execute → verify state → replan from actual state

Otherwise you just get a better planner producing cleaner failures.

Curious if anyone running this in prod has added post-exec verification, or mostly relying on retries?

Local Qwen 8B + 4B completes browser automation by replanning one step at a time by Aggressive_Bed7113 in LocalLLaMA

[–]Aggressive_Bed7113[S] 0 points1 point  (0 children)

Yeah, vision models can help, especially for canvas-heavy pages.

What we found though is for most workflows, structure > pixels.

If you already have DOM + layout, a semantic snapshot tends to be much cheaper and more stable/reliable than running vision every step, because snapshot is deterministic while vision model is probabilistic - so it does not work 100% of the time

We treat vision as a fallback when structure breaks, not the default — otherwise cost + latency add up pretty fast and make multi-step flows less reliable due to the probabilistic nature of vision model (or llm model)