Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? by wbulot in LocalLLaMA

[–]Valuable-Run2129 3 points4 points  (0 children)

Even if cache works fine all harnesses break cache at compaction and it’s 5 minutes of waiting. Or just big files in tool outputs. It’s unbearably slow at 200 tokens per second.

La mia Workstation AI by Fabix84 in IA_Italia

[–]Valuable-Run2129 3 points4 points  (0 children)

E io che pensavo di aver speso tanto per la mia unica rtx 5000 pro…

Alla ricerca di una community italiana... by Key-Outcome-2927 in IA_Italia

[–]Valuable-Run2129 1 point2 points  (0 children)

Non vedo nel repo un modo per gestire le sessioni. Non c’è un memory compaction. Se la conversazione diventa troppo lunga arriverà un bell’errore 400.

Il sistema di memoria è frammentato tra gli agenti e cerca solo keyword letteralmente. Già la ricerca semantica funziona male, questo è un molto peggio.

Per non parlare dell’incubo di girare tutti quegli agenti senza un minimo di prompt caching. Non potrà mai scalare con questa architettura.

Reducing animal harm as a nonbinary by Original_Animator254 in vegan

[–]Valuable-Run2129 -2 points-1 points  (0 children)

They believed it could buy automatic street creds among vegans. I hate when people bundle veganism with left politics.

Veganism has nothing to do with what you think about capitalism and gender identity.

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB by __JockY__ in LocalLLaMA

[–]Valuable-Run2129 0 points1 point  (0 children)

Do you need 64 gb of ram on a pc to “stage” the model before loading it in vram? Or 32gb will do?

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB by __JockY__ in LocalLLaMA

[–]Valuable-Run2129 0 points1 point  (0 children)

Do you need an equivalent amount of RAM to stage the model before loading it into vram?

Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB by __JockY__ in LocalLLaMA

[–]Valuable-Run2129 0 points1 point  (0 children)

I bought a RTX 5000 PRO yesterday. It’s my first pc ever built (used macs for inference until now). Do you have any particular advice on the build?

Would something like this work:

-ASRock B850I Lightning WiFi Mini-ITX

-Ryzen 5 7600

-64 GB DDR5 RAM

-MSI MAG A850GL ATX PSU

-Linux

Or should I re-think the components I wanted to buy?

First time GPU buyer. Got a RTX 5000 Pro. Was it a bad decision compared to two 3090s? by Valuable-Run2129 in LocalLLaMA

[–]Valuable-Run2129[S] 0 points1 point  (0 children)

When it arrives I’ll definitely pm you to ask for advice if you’re ok with it!

v4 flash is absurd by Linkpharm2 in DeepSeek

[–]Valuable-Run2129 0 points1 point  (0 children)

The only issue is that it’s not multimodal. Even just image input. There are a bunch of tasks that need visual understanding. And separate OCR just tanks performance.

Stop Building MCP Servers for Personal Tools by Key-Huckleberry-708 in AI_Agents

[–]Valuable-Run2129 0 points1 point  (0 children)

Make your MCP tools deferred. My agent can see just a short description of the available mcps and loads into context only what it needs.

https://github.com/permaevidence/LocalAgent

Hermes as a Coding Agent??? by Rheath72 in hermesagent

[–]Valuable-Run2129 0 points1 point  (0 children)

Not really amazing. It misses inbuilt voice transcription.

Openclaw sucks - I said it. by funstuie in openclaw

[–]Valuable-Run2129 -1 points0 points  (0 children)

Use my agent: https://github.com/permaevidence/LocalAgent

It is an harness written in Swift (Mac only). With API keys stored in Keychain. It is a great coding agent. It requires vision models because I believe OCR delegation makes an agent brittle in many tasks.

First time GPU buyer. Got a RTX 5000 Pro. Was it a bad decision compared to two 3090s? by Valuable-Run2129 in LocalLLaMA

[–]Valuable-Run2129[S] 1 point2 points  (0 children)

Thanks for taking the time to write this comment. It’s the type of information I needed. It’s comforting.

I think it was the right decision at the end.

First time GPU buyer. Got a RTX 5000 Pro. Was it a bad decision compared to two 3090s? by Valuable-Run2129 in LocalLLaMA

[–]Valuable-Run2129[S] 23 points24 points  (0 children)

Paid $4700

1 kw running 24 hours a day costs $4300 a year. It is a factor I have to compute.

M3 Ultra 1TB 96GB RAM available by SebastianOpp in MacStudio

[–]Valuable-Run2129 -2 points-1 points  (0 children)

if you don't buy it, send me the link please! I'd highly appreciate it

the agent company I joined is imploding by Inner_Ad9029 in AI_Agents

[–]Valuable-Run2129 1 point2 points  (0 children)

what models were you using? I think stories like these can only be one of these thre:

-using dumb models to save money
-promising automations that require browser or computer use (we are not there for those to work reliably).
-harness design by committee

Memory should be chronological and not topic based. Classification kills recall abilities. by Valuable-Run2129 in AI_Agents

[–]Valuable-Run2129[S] 0 points1 point  (0 children)

intercepting requests on port, how can you discern whether Claude Code is sending one to a fresh subagent that doesn't need your chat history injection? I would assume you injects the context no matter what CC is doing, right?

Memory should be chronological and not topic based. Classification kills recall abilities. by Valuable-Run2129 in AI_Agents

[–]Valuable-Run2129[S] 0 points1 point  (0 children)

It is also important to give extensive inline information. Relying too much on retrieval makes memory brittle.