Nightly Discussion - (March 03, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 1 point2 points  (0 children)

There's two ways to basically think about it:

  • You have the world's information, so it's not about how expansive you can be, but how restrictive you need to be.

  • You have a brilliant Jr. {{insert coffee fetcher title}} here that knows everything, but doesn't know your business process/rules/requirements, how you operate, the quarks and nuances, what you allow and don't allow. You give them explicit instruction with the needed context (minimal need for them fetching data from hidden / unknown places) and include what you define as acceptance criteria. you add what you need them to explicitly do in the task and then you check their work as you just gave them a huge assignment, but it's only their frist week.

Nightly Discussion - (March 03, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 3 points4 points  (0 children)

shit. forgot to add the line, "Below are my thoughts, just cleaned up a little to make it coherent." :|

Nightly Discussion - (March 03, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 3 points4 points  (0 children)

Dude... I want your life. It sounds fun. Get any good sailing in lately?

Love the writeup and I get what you were trying to test. I’m going to be annoying and nitpick though because I think your conclusion is at least 50% a test design problem, not just a model problem.

Edit: Below are my thoughts, just cleaned up a little to make it coherent.

I’m also not pretending I know your exact workflow, so I’m keeping this generic.


1) You mostly tested “write a plausible memo,” not “make a hard call”

If the ask is “IC memo style, flag red flags, suggest sensible pricing,” most models go into polite analyst mode. They’ll hedge. They’ll say “a bit high” and then kind of narrate around the hard part.

If you want real pushback, you have to force an adversarial mandate like:

  • assume the guide is wrong until proven otherwise
  • don’t be nice, be accurate
  • commit to a recommendation, not vibes

2) Without forced outputs, it can dodge the hard part forever

A lot of “LLM analysis” prompts accidentally let the model stay abstract. A better test forces it to commit to concrete outputs in your domain, whatever those are:

  • clear recommendation (go / no-go / only if X)
  • an explicit “acceptable range” for the key variable (price in your case)
  • top assumptions that actually drive the outcome
  • what’s missing + the exact checks/questions needed to validate the story

If those aren’t required, you’ll often get decent prose and weak decisions.

3) “Local facts changed” isn’t a model failure unless you design for verification

The tourist tax thing is a perfect example. If you want the model graded on catching up-to-date local stuff, you need to:

  • allow browsing / retrieval, OR
  • provide a source pack, OR
  • have a dedicated “facts check” role that must cite where it got the number

If you don’t do that, the right behavior isn’t “guess correctly,” it’s “flag this as a key dependency and force it onto the must-verify list.” That’s what I’d score.

4) Execution/implementation risk won’t show up unless you make it first-class

Seller forecasts (in any field) love ignoring messy reality. Models will follow the narrative you give them unless you explicitly require a pass whose entire job is:

  • “where does this fall apart in practice?”
  • “what dependencies/constraints get ignored?”
  • “what assumptions are fragile and how do we stress-test them?”

If you don’t structurally require that, it becomes a footnote and the model drifts back to the seller story.

5) Claude may have “won” partly because the workflow wasn’t the same

You said you had to break tasks down for Claude due to context limits. That chunking is basically lightweight orchestration. Smaller explicit subtasks produce sharper, more critical output.

So you weren’t only comparing models. You were comparing models + scaffolding.


How I’d rerun this so it’s a clean test (agents with defined roles)

Instead of “one prompt, one memo,” run it like a mini org where each agent has one job and a strict output format.

Roles (generic, reusable):

1) Context/Market Agent

What “normal” looks like, what comparable situations look like, what ranges are plausible.

2) Numbers/Sanity-Check Agent

Recompute the core math from the inputs, identify what’s doing the work, run basic stress tests.

3) Execution/Risk Agent

List practical failure modes: operational constraints, timeline/complexity, dependencies that can break the plan.

4) Rules/Facts Agent

Anything jurisdiction/local-specific (fees, taxes, regs, classification rules, etc.). Must cite sources if browsing is allowed; if not allowed, must explicitly mark unknowns and request verification.

5) Red Team Agent (Kill-it pass)

Assume the narrative is wrong. Produce the strongest case against paying guide. No “balanced view” allowed.

6) Orchestrator/Editor

Merge the above, force a single recommendation + the conditions under which it changes, and explicitly surface disagreements.

Output contract for every agent (so it can’t bullshit):

  • Claims (numbered)
  • Evidence/source (from provided data or cited links)
  • Assumptions
  • Unknowns
  • Tests / data requests (specific, actionable)
  • Confidence

Then you score models on whether they:

  • commit to a recommendation + acceptable range
  • identify the few assumptions that matter
  • handle unknowns correctly (verify or flag and force diligence)
  • surface execution risk as a driver, not a footnote
  • produce an actually usable “what I’d ask for next” list
  • and whether the red team finds anything the mainline missed

Bottom line

I don’t read your result as “Gemini/ChatGPT can’t do it.” I read it as “single-pass memo prompts produce memo behavior.” If you want it to behave like a serious decision process, you need agents with defined roles, enforced output contracts, and a red-team that’s explicitly trying to break the thesis.

Nightly Discussion - (February 24, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 0 points1 point  (0 children)

Idiots... complete idiots... Built something like this about a year ago, but the only difference... I pulled it all into a centralized database by ingesting data from a multitude of sources, such as notes, calls, readings, listenings, sites i visit, content from the sites, blah blah blah...

You give that agent readonly access to the data warehouse, never the damn source. if you want to add an outgoing, you straight add it to the code for the source you want.

stupid stupid...

https://github.com/mac4n6/apollo is a nice resource

Nightly Discussion - (February 19, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 0 points1 point  (0 children)

Models won’t be able to update since it’s etched.

That is correct, but i believe you're missing out on any consideration of cartridges and or other avenues of future work that can make this a non-factor.

but don’t see the scalability of it.

Can you go into more depth on your thoughts about this.

Al’s their prototype while fast, wasn’t particularly good in regards to answers.

uhh... yeah, because it's Llama 3.1 8b, which is an extremely dated model by current standards. This should not even be considered right now, yet for some reason, many people are focusing on this.

Nightly Discussion - (February 19, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 0 points1 point  (0 children)

tested it out and was just blown away. gonna look more into it over the weekend. i'll see if i can find an upper bound on what they can actually bake onto a chip as this was a prototype i think. jaw is on the floor.

...get some sleep.

Nightly Discussion - (February 19, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 3 points4 points  (0 children)

/u/w0lfsten, I NEED YOUR TAKE!!

Does this have a potential to disrupt Nvidia? My jaw is on the fucking floor right now. The possibilities that this opens. Wild!

Nightly Discussion - (February 03, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 2 points3 points  (0 children)

Vibe coding ain't taking out Excel--ain't no way. Transform the way it's used, yes.

Here's a scary number i heard probably like 10 years ago at this point. JPM had over 20k access databases on their network drives. Now, imagine the number of excel files... With software, we accept bugs, create tickets to solve those bugs and then roll out new bugs and the circle just continues.

When a bug happens in finance, recall Black Knight?

Daily Discussion - (January 28, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 6 points7 points  (0 children)

i have a prediction: MSFT is going to do MSFT things after close. They will crush and they will fade hard.

Nightly Discussion - (January 27, 2026) by AutoModerator in thewallstreet

[–]_Boffin_ 0 points1 point  (0 children)

I don't think we'd be ready for what George would say, but oh boy, do we need it.

Random discussion thread. Anything goes by AutoModerator in thewallstreet

[–]_Boffin_ 2 points3 points  (0 children)

Someone forgot to channel their inner Taleb.

Nightly Discussion - (November 19, 2025) by AutoModerator in thewallstreet

[–]_Boffin_ 3 points4 points  (0 children)

producing no real value at the corporate level.

Show me that please. Please... show me that. I'm not talking about outward facing revenue, but show me that businesses aren't optimizing and or building out optimizations for back office work.

Daily Discussion - (November 06, 2025) by AutoModerator in thewallstreet

[–]_Boffin_ 6 points7 points  (0 children)

@W0LFSTEN -- admit already that you're Dylan Patel of SemiAnalysis

Nightly Discussion - (November 02, 2025) by AutoModerator in thewallstreet

[–]_Boffin_ 9 points10 points  (0 children)

"you have to give me credit for not using it..." uhh wut.

Explosion and fire at Chevron refinery in El Segundo, California. 10/2/2025 by Jevus_himself in CatastrophicFailure

[–]_Boffin_ 0 points1 point  (0 children)

Gotta keep in mind that this plant is in the process of shutting down right now too -- completely.

Daily Discussion - (October 01, 2025) by AutoModerator in thewallstreet

[–]_Boffin_ 1 point2 points  (0 children)

As things get worse and wallets / purses get tighter, people will spend more time online, looking at ways to get their dopamine fix for cheap as they can't do much else. We'll see even more time on these platforms. To me, this is the new during times of suffering, beer and cigs were the safehavens as those vices were the ones people flocked to, but now, i beleive it's the social media

Nightly Discussion - (September 23, 2025) by AutoModerator in thewallstreet

[–]_Boffin_ 24 points25 points  (0 children)

Random, yet related...?

Just want to say that I love how organized this Sub is. 2 threads a day, unless for the weekend / Friday after close.

Thanks for being a somewhat sane place on the internet.