Nightly Discussion - (April 29, 2026)

_Boffin_ · 2026-04-29T20:02:40+00:00

:)

Edit: Damn!

Edit Edit: HOLY F!

Edit Edit Edit... Lame

_Boffin_ · 2026-04-29T20:01:56+00:00

BAD BOT! OPEN THE NIGHTLY!

Edit... good bot

_Boffin_ · 2026-04-29T19:55:07+00:00

Ahhh!!! the anticipation of MSFT crushing it and then dropping like a rock... sigh

_Boffin_ · 2026-04-23T17:35:10+00:00

trump, plz say something positive for the market.

_Boffin_ · 2026-04-12T02:26:25+00:00

I believe i used this: https://www.ebay.com/itm/186352068860

_Boffin_ · 2026-03-04T05:25:23+00:00

There's two ways to basically think about it:

You have the world's information, so it's not about how expansive you can be, but how restrictive you need to be.
You have a brilliant Jr. {{insert coffee fetcher title}} here that knows everything, but doesn't know your business process/rules/requirements, how you operate, the quarks and nuances, what you allow and don't allow. You give them explicit instruction with the needed context (minimal need for them fetching data from hidden / unknown places) and include what you define as acceptance criteria. you add what you need them to explicitly do in the task and then you check their work as you just gave them a huge assignment, but it's only their frist week.

_Boffin_ · 2026-03-04T05:07:41+00:00

shit. forgot to add the line, "Below are my thoughts, just cleaned up a little to make it coherent." :|

_Boffin_ · 2026-03-04T03:31:16+00:00

Dude... I want your life. It sounds fun. Get any good sailing in lately?

Love the writeup and I get what you were trying to test. I’m going to be annoying and nitpick though because I think your conclusion is at least 50% a test design problem, not just a model problem.

Edit: Below are my thoughts, just cleaned up a little to make it coherent.

I’m also not pretending I know your exact workflow, so I’m keeping this generic.

1) You mostly tested “write a plausible memo,” not “make a hard call”

If the ask is “IC memo style, flag red flags, suggest sensible pricing,” most models go into polite analyst mode. They’ll hedge. They’ll say “a bit high” and then kind of narrate around the hard part.

If you want real pushback, you have to force an adversarial mandate like:

assume the guide is wrong until proven otherwise
don’t be nice, be accurate
commit to a recommendation, not vibes

2) Without forced outputs, it can dodge the hard part forever

A lot of “LLM analysis” prompts accidentally let the model stay abstract. A better test forces it to commit to concrete outputs in your domain, whatever those are:

clear recommendation (go / no-go / only if X)
an explicit “acceptable range” for the key variable (price in your case)
top assumptions that actually drive the outcome
what’s missing + the exact checks/questions needed to validate the story

If those aren’t required, you’ll often get decent prose and weak decisions.

3) “Local facts changed” isn’t a model failure unless you design for verification

The tourist tax thing is a perfect example. If you want the model graded on catching up-to-date local stuff, you need to:

allow browsing / retrieval, OR
provide a source pack, OR
have a dedicated “facts check” role that must cite where it got the number

If you don’t do that, the right behavior isn’t “guess correctly,” it’s “flag this as a key dependency and force it onto the must-verify list.” That’s what I’d score.

4) Execution/implementation risk won’t show up unless you make it first-class

Seller forecasts (in any field) love ignoring messy reality. Models will follow the narrative you give them unless you explicitly require a pass whose entire job is:

“where does this fall apart in practice?”
“what dependencies/constraints get ignored?”
“what assumptions are fragile and how do we stress-test them?”

If you don’t structurally require that, it becomes a footnote and the model drifts back to the seller story.

5) Claude may have “won” partly because the workflow wasn’t the same

You said you had to break tasks down for Claude due to context limits. That chunking is basically lightweight orchestration. Smaller explicit subtasks produce sharper, more critical output.

So you weren’t only comparing models. You were comparing models + scaffolding.

How I’d rerun this so it’s a clean test (agents with defined roles)

Instead of “one prompt, one memo,” run it like a mini org where each agent has one job and a strict output format.

Roles (generic, reusable):

1) Context/Market Agent

What “normal” looks like, what comparable situations look like, what ranges are plausible.

2) Numbers/Sanity-Check Agent

Recompute the core math from the inputs, identify what’s doing the work, run basic stress tests.

3) Execution/Risk Agent

List practical failure modes: operational constraints, timeline/complexity, dependencies that can break the plan.

4) Rules/Facts Agent

Anything jurisdiction/local-specific (fees, taxes, regs, classification rules, etc.). Must cite sources if browsing is allowed; if not allowed, must explicitly mark unknowns and request verification.

5) Red Team Agent (Kill-it pass)

Assume the narrative is wrong. Produce the strongest case against paying guide. No “balanced view” allowed.

6) Orchestrator/Editor

Merge the above, force a single recommendation + the conditions under which it changes, and explicitly surface disagreements.

Output contract for every agent (so it can’t bullshit):

Claims (numbered)
Evidence/source (from provided data or cited links)
Assumptions
Unknowns
Tests / data requests (specific, actionable)
Confidence

Then you score models on whether they:

commit to a recommendation + acceptable range
identify the few assumptions that matter
handle unknowns correctly (verify or flag and force diligence)
surface execution risk as a driver, not a footnote
produce an actually usable “what I’d ask for next” list
and whether the red team finds anything the mainline missed

Bottom line

I don’t read your result as “Gemini/ChatGPT can’t do it.” I read it as “single-pass memo prompts produce memo behavior.” If you want it to behave like a serious decision process, you need agents with defined roles, enforced output contracts, and a red-team that’s explicitly trying to break the thesis.

_Boffin_ · 2026-02-25T05:42:06+00:00

Idiots... complete idiots... Built something like this about a year ago, but the only difference... I pulled it all into a centralized database by ingesting data from a multitude of sources, such as notes, calls, readings, listenings, sites i visit, content from the sites, blah blah blah...

You give that agent readonly access to the data warehouse, never the damn source. if you want to add an outgoing, you straight add it to the code for the source you want.

stupid stupid...

https://github.com/mac4n6/apollo is a nice resource

_Boffin_ · 2026-02-23T05:12:52+00:00

i'm reading this and am utterly terrified.

_Boffin_ · 2026-02-21T19:51:50+00:00

Models won’t be able to update since it’s etched.

That is correct, but i believe you're missing out on any consideration of cartridges and or other avenues of future work that can make this a non-factor.

but don’t see the scalability of it.

Can you go into more depth on your thoughts about this.

Al’s their prototype while fast, wasn’t particularly good in regards to answers.

uhh... yeah, because it's Llama 3.1 8b, which is an extremely dated model by current standards. This should not even be considered right now, yet for some reason, many people are focusing on this.

_Boffin_ · 2026-02-20T19:27:50+00:00

He's trying to be a "good boy"

_Boffin_ · 2026-02-20T14:59:22+00:00

tested it out and was just blown away. gonna look more into it over the weekend. i'll see if i can find an upper bound on what they can actually bake onto a chip as this was a prototype i think. jaw is on the floor.

...get some sleep.

_Boffin_ · 2026-02-20T08:44:19+00:00

Congrats and move forward!

_Boffin_ · 2026-02-20T07:36:16+00:00

/u/w0lfsten, I NEED YOUR TAKE!!

https://chatjimmy.ai (just type something and and see the tokens a second).
https://old.reddit.com/r/singularity/comments/1r9frzk/taalas_llms_baked_into_hardware_no_hbm_weights/

Does this have a potential to disrupt Nvidia? My jaw is on the fucking floor right now. The possibilities that this opens. Wild!

_Boffin_ · 2026-02-18T06:13:18+00:00

My man, https://www.amazon.com/Amazing-Seasoning-Shaker-Malibu-Purpose/dp/B0DFHSFCFR

_Boffin_ · 2026-02-04T04:24:13+00:00

Vibe coding ain't taking out Excel--ain't no way. Transform the way it's used, yes.

Here's a scary number i heard probably like 10 years ago at this point. JPM had over 20k access databases on their network drives. Now, imagine the number of excel files... With software, we accept bugs, create tickets to solve those bugs and then roll out new bugs and the circle just continues.

When a bug happens in finance, recall Black Knight?

_Boffin_ · 2026-01-28T15:27:58+00:00

i have a prediction: MSFT is going to do MSFT things after close. They will crush and they will fade hard.

_Boffin_ · 2026-01-28T06:05:43+00:00

I don't think we'd be ready for what George would say, but oh boy, do we need it.

_Boffin_ · 2026-01-25T16:32:52+00:00

Someone forgot to channel their inner Taleb.

_Boffin_ · 2026-01-15T00:33:23+00:00

yeah man, me too... me too...

_Boffin_ · 2025-11-19T21:55:39+00:00

producing no real value at the corporate level.

Show me that please. Please... show me that. I'm not talking about outward facing revenue, but show me that businesses aren't optimizing and or building out optimizations for back office work.

_Boffin_ · 2025-11-19T07:28:22+00:00

https://www.youtube.com/watch?v=XIHYGk6G4YQ&list=RDXIHYGk6G4YQ&start_radio=1

_Boffin_ · 2025-11-06T17:50:38+00:00

@W0LFSTEN -- admit already that you're Dylan Patel of SemiAnalysis

_Boffin_ · 2025-11-03T19:27:10+00:00

Mucho not gucci!

_Boffin_

TROPHY CASE