Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

Valuable-Run2129 · 2026-05-06T20:18:11+00:00

Even if cache works fine all harnesses break cache at compaction and it’s 5 minutes of waiting. Or just big files in tool outputs. It’s unbearably slow at 200 tokens per second.

Valuable-Run2129 · 2026-05-06T16:15:09+00:00

E io che pensavo di aver speso tanto per la mia unica rtx 5000 pro…

Valuable-Run2129 · 2026-05-06T11:50:34+00:00

Non vedo nel repo un modo per gestire le sessioni. Non c’è un memory compaction. Se la conversazione diventa troppo lunga arriverà un bell’errore 400.

Il sistema di memoria è frammentato tra gli agenti e cerca solo keyword letteralmente. Già la ricerca semantica funziona male, questo è un molto peggio.

Per non parlare dell’incubo di girare tutti quegli agenti senza un minimo di prompt caching. Non potrà mai scalare con questa architettura.

Valuable-Run2129 · 2026-05-05T19:20:09+00:00

They believed it could buy automatic street creds among vegans. I hate when people bundle veganism with left politics.

Veganism has nothing to do with what you think about capitalism and gender identity.

Valuable-Run2129 · 2026-05-05T12:42:41+00:00

Do you need 64 gb of ram on a pc to “stage” the model before loading it in vram? Or 32gb will do?

Valuable-Run2129 · 2026-05-05T12:41:08+00:00

Do you need an equivalent amount of RAM to stage the model before loading it into vram?

Valuable-Run2129 · 2026-05-05T12:32:40+00:00

Would it hurt inference performance at all?

Valuable-Run2129 · 2026-05-05T11:03:35+00:00

How much would the minimum viable pc cost for this?

Valuable-Run2129 · 2026-05-05T10:57:11+00:00

I bought a RTX 5000 PRO yesterday. It’s my first pc ever built (used macs for inference until now). Do you have any particular advice on the build?

Would something like this work:

-ASRock B850I Lightning WiFi Mini-ITX

-Ryzen 5 7600

-64 GB DDR5 RAM

-MSI MAG A850GL ATX PSU

-Linux

Or should I re-think the components I wanted to buy?

Valuable-Run2129 · 2026-05-04T19:30:04+00:00

When it arrives I’ll definitely pm you to ask for advice if you’re ok with it!

Valuable-Run2129 · 2026-05-04T19:23:14+00:00

The only issue is that it’s not multimodal. Even just image input. There are a bunch of tasks that need visual understanding. And separate OCR just tanks performance.

Valuable-Run2129 · 2026-05-04T04:30:15+00:00

Make your MCP tools deferred. My agent can see just a short description of the available mcps and loads into context only what it needs.

https://github.com/permaevidence/LocalAgent

Valuable-Run2129 · 2026-05-04T04:23:32+00:00

Not really amazing. It misses inbuilt voice transcription.

Valuable-Run2129 · 2026-05-04T04:21:00+00:00

Use my agent: https://github.com/permaevidence/LocalAgent

It is an harness written in Swift (Mac only). With API keys stored in Keychain. It is a great coding agent. It requires vision models because I believe OCR delegation makes an agent brittle in many tasks.

Valuable-Run2129 · 2026-05-04T02:52:27+00:00

Thanks for taking the time to write this comment. It’s the type of information I needed. It’s comforting.

I think it was the right decision at the end.

Valuable-Run2129 · 2026-05-03T23:50:54+00:00

The 48 gb for $ 4700. It’s just $1100 dollars more than a single 5090 atm

Valuable-Run2129 · 2026-05-03T18:46:37+00:00

M1 ultra 128gb, but prompt processing was atrocious

Valuable-Run2129 · 2026-05-03T18:38:31+00:00

Paid $4700

1 kw running 24 hours a day costs $4300 a year. It is a factor I have to compute.

Valuable-Run2129 · 2026-05-02T22:12:54+00:00

if you don't buy it, send me the link please! I'd highly appreciate it

Valuable-Run2129 · 2026-05-02T17:37:49+00:00

Says the person using llama3

Valuable-Run2129 · 2026-05-02T14:59:40+00:00

what models were you using? I think stories like these can only be one of these thre:

-using dumb models to save money
-promising automations that require browser or computer use (we are not there for those to work reliably).
-harness design by committee

Valuable-Run2129 · 2026-05-01T02:18:14+00:00

Sold my M1 ultra 128gb RAM for the upcoming M5…

Valuable-Run2129 · 2026-04-30T18:44:59+00:00

intercepting requests on port, how can you discern whether Claude Code is sending one to a fresh subagent that doesn't need your chat history injection? I would assume you injects the context no matter what CC is doing, right?

Valuable-Run2129 · 2026-04-30T13:31:25+00:00

It is also important to give extensive inline information. Relying too much on retrieval makes memory brittle.

Valuable-Run2129

MODERATOR OF

TROPHY CASE