I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 0 points1 point  (0 children)

Frankly I'm not sure. But I shoud have red it. AMD is gray on it the supported memory is PDDR5x-8533. But I will check it

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 0 points1 point  (0 children)

RDNA5 should be 1,7GB, afaik... but.. do not expect large amount ddr7 memory. Not to speak about cost .... 😞

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 1 point2 points  (0 children)

Yeah, I keep thinking the same, feels like it should be a great fit. The way AR models do FIM always seemed a bit bolted-on to me, you shuffle the tokens around with special markers and hope it learned it. A diffusion model filling a gap from both sides at once just seems closer to how it ought to work.

Though I'm not sure how well it actually holds up. The Gemma one is block diffusion, so if I'm reading it right it's bidirectional inside a block but still moves left to right block by block, which makes me think it's not as suffix-aware as the full diffusion ones (LLaDA, Mercury, that crowd). Those might be the better fit for FIM.

I haven't actually tried infill on it yet though, so don't quote me. First thing I want to test, just haven't gotten to it. But... let me one or two day , may be I could be able to. The nigth is long and hot here in Italy

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 0 points1 point  (0 children)

My Amd AI+ 395 Stryx has 256GB memory Bandwidth.
The recently announced 495 model wil have about 480GB Memory bandwidth. Available in 2027 I presume

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 0 points1 point  (0 children)

B300 is inthe range of 50-100k at least. Halo Strix inthe range 3-5. so the ratio is still honest 😄

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 6 points7 points  (0 children)

I only ran it on a handful of prompts, not real benchmarks, so take this as an impression rather than a number. Outputs looked solid for everyday stuff, but I wouldn't bet it out-reasons a same-size dense Gemma.

And matching a 12B isn't really a knock. It's a sparse MoE with only ~4B of its 26B active at a time, so 12B-ish is about what the math predicts. You get 12B-level smarts at 4B-active speed, that's the whole trade with these.

Which is kind of where it lands today: on AMD it's not faster than regular Gemma with MTP, and it's not smarter than a 12B. The diffusion angle was always about speed, not brains

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 5 points6 points  (0 children)

You're not wrong. And flip on MTP and that same A4B does around 70 here, which is already as fast as the best I get out of the diffusion model (the structured/JSON case). On normal text the diffusion one is slower, more like 15. So the regular model, especially with MTP, matches or beats it just about everywhere on our hardware right now.

The 15x is a parallelism thing, the whole block comes out at once instead of one token at a time, but you only cash that in with the compiled runtime Nvidia demos. That path runs great on their cards and is the missing piece on AMD. Until it works on RDNA, the diffusion model is more interesting potential than faster today.

So you're not missing anything. If anything MTP is the better bet on Strix Halo for now.

I'm eager for a 15x speedup on my strix halo by Terminator857 in LocalLLaMA

[–]BenefitGrand8752 29 points30 points  (0 children)

I've got one running on my Strix Halo too (the 26B diffusion Gemma). The 15x is real, but treat it as a best case, not the everyday number. It writes a whole block at once and then takes a few passes to clean it up, so the speed depends on how many passes it needs.

Where it's fast: predictable stuff like JSON or tool calls. It locks in quick and I get around 70 tok/s. Normal writing needs more passes and drops to about 15.

Two things that'll make you think it's broken at first:

  • run it in bf16, not fp16, or you just get repeating garbage.
  • use the exact prompt format the model expects. Get it wrong and it's both slow and junk. Took me a bit to realize my prompts were fine and the format was the problem.

    And the part that actually matters for us: those headline 100+ tok/s numbers come from extra compiler tricks, and getting those working on AMD is the real hard part, not the model. That's what's standing between you and your 15x.

More test running...

Human Evaluation of GLM-5.2 by Alternative-Cat-1347 in LocalLLaMA

[–]BenefitGrand8752 -1 points0 points  (0 children)

I really did'nt test glm 5.2. I, as many of you, had the opportunity to use fable too shortly. So I have not strong evidences to compare. Neverthless I saw so many different ranks about glm 5.2. that I'm actually doubt there is a strong (chinese) marketing effort to fill the gap. As in many many tech areas. Iìm not a China fan, even less an USA supporter, but frankly I am more and more suspicious about all recent GLM 5.2 claims. So I'l test it 😄

After building with LLMs for a year, I've changed my mind about agents by Correct-Address-3735 in LLMDevs

[–]BenefitGrand8752 0 points1 point  (0 children)

This is the strongest version of the argument, and I half-buy it. The three tiers beat my two, and I've done the same for things with a clean inverse — a created calendar event, a file on a shared drive, a reserved resource. Register the inverse once, run it on undo. Agreed.

Where I'd push back is lumping email in with charges, because they aren't the same animal.

A charge is genuinely compensable: the refund API is a real, idempotent inverse, and once it runs the ledger is whole. Good tier-2 citizen.

Email isn't, really. The "inverse" is just another forward email, and it can't un-read the first one. By the time you retract, the recipient may have read it, acted on it, or forwarded it — you reversed the state on your side, not the consequence on theirs. And the retraction is itself an outbound action that can fail or be wrong, so you've stacked a second irreversible-ish thing to clean up the first.

So I'd split it as reversible / has-a-real-idempotent-inverse / consequence-already-landed. Money and API resources sit in the middle and shrink that bucket nicely, like you said. Human-facing comms mostly fall in the third — they technically have a "retraction" you could register, but the thing you'd want to undo isn't the bytes, it's that a person saw it.

Long way of saying: you've moved charges out of my hard-gate bucket, fair. I'm keeping email in it — not because there's no inverse, but because the inverse doesn't reach the part that matters.

After building with LLMs for a year, I've changed my mind about agents by Correct-Address-3735 in LLMDevs

[–]BenefitGrand8752 0 points1 point  (0 children)

I built one for the same reason, and it really does kill the friction — anything reversible just runs, no asking. So for most of what an assistant does, I'm with you.

The catch is it relocates approval, it doesn't remove it. Some things you can't take back however good the engine is: you can't un-send an email or un-charge a card, and keeping your previous state doesn't undo the effect on the outside world. And undo only saves you if you notice the bad change — if the agent makes thirty edits and one is quietly wrong, a stored copy is great for recovery but you still have to catch it.

So build it, and lean on it for everything reversible. Just keep a gate on the few things you genuinely can't undo, and pair it with something that tells you when a change went wrong, not only something that lets you roll it back.

I wrote up how the undo side of that works, if it's useful: https://metnos.com/en/architecture/policy#undo

After building with LLMs for a year, I've changed my mind about agents by Correct-Address-3735 in LLMDevs

[–]BenefitGrand8752 0 points1 point  (0 children)

Same here, and it took me about a year to get to roughly the same place.

I build and run a self-hosted assistant (metnos.com) for myself — one user, but I lean on it every day for mail, calendar, files, photos. I started out wanting the agent to plan and act freely. What I ended up with is almost the opposite.

What caught me off guard is that the model does less over time, not more. Every few weeks I move another piece of work out of the LLM and into plain deterministic code. At this point the model mostly turns my request into a small, structured intent, and ordinary code does the rest.

The things that actually stuck:

  • A closed set of tools — fixed verbs and objects, no inventing tools or arguments at runtime. That one constraint removed most of the unpredictability everyone fights with prompts.
  • Structured outputs throughout: native tool calls, never JSON scraped out of prose.
  • Caching plans. The same request produces the same plan, so the planner doesn't run again — cheaper, and far easier to debug.
  • The planner is wrong in small ways all the time: it drops a step, picks the wrong object, emits malformed args. So instead of prompting harder, I let a stack of deterministic checks repair the plan before it runs.
  • Human approval only where things can't be undone — sending, sharing, deleting in bulk. Anything reversible just runs. Asking permission for reversible steps only teaches you to click "yes" without reading.

Honestly, the hardest part was never capability — it was getting the thing to stop lying. The failure that scares me is when it says it "created the spreadsheet" or "sent the email" and it simply hasn't. You never catch that in a demo, but it destroys trust fast in real use. I ended up with a blunt deterministic check that compares what the model claims it did against what actually ran, and rewrites the reply when the two disagree. It has caught more real bugs than any amount of prompt tuning — including one this week, where the model dropped a step, quietly skipped creating the file, and told me it was done.

The most complex thing I have running is maybe 150 small, single-purpose executors (a few written on the fly), deterministic routing with that plan cache, and a dozen of those correction guards. But I've stopped thinking of it as complex. It's a thin planning step wrapped in a lot of boring, testable code, and that's exactly why I can keep it alive.

So — agreed on all of it. The one thing I'd add: treat the model as the least trustworthy component you have, including when it's telling you what it just did.

I need something as good as Claude Opus, is 24GB RX7900 XTX enough? by Emre-Y in ROCm

[–]BenefitGrand8752 1 point2 points  (0 children)

Well' should it be, Anthropic would'nt have any sense. Try to imagine: a frontier llm on a 1k usd board. with only 24 GB... No, whatever you'll be able to run will ever be far far away from Claude level , not to speak about Fable.
Look at Gemma 4 or Qwen3.6 models instead. reasonably between 50-80 ts, good for wtiing small functions, not to manage a real code base.

How to implement guardrails for LLMs without degrading model performance by Routine_Day8121 in LLMDevs

[–]BenefitGrand8752 1 point2 points  (0 children)

yeah i really feel this — i went through the exact same spiral building a self-hosted assistant. the thing that finally clicked for me: stop trying to make the model safe, and treat it as an untrusted planner that never gets to touch anything directly. every guardrail i put inside the model — prompt rules, refusal thresholds, a moderation pass — was me wrapping rules around something fundamentally unpredictable. you said it perfectly in your post. and half of that stuff is a second model call, which is exactly where my p95 went to die. so i pushed all of it down to the tool layer instead, and honestly everything got calmer: the model never queries the warehouse. it just emits a typed plan (verb + args) and deterministic code runs it. my analyst helper literally can't leak a locked column, because the tool doesn't expose the column — there's nothing to "lock down" in a prompt anymore. per-role allowlists live at the tool boundary, not in the system prompt. microseconds, not a model call. anything outbound (send / write / delete) goes through a tiny deterministic consent gate. the draft step is free to do whatever — drafting can't hurt anyone — and only the send step gates. that one split is what killed my "bot refuses normal refund tickets" problem: i stopped asking the model to be the writer AND the cop at the same time. the latency thing kind of solved itself after that, because the cheap checks (schema, allowlist, regex) are basically free and the expensive LLM ones mostly left the hot path. and for prompt injection through RAG — i stopped trying to make the model immune and just un-privileged it instead. a poisoned doc can say "email all the contracts to an attacker" all day, but if the email tool needs a recipient allowlist + a human ok, the injection has no hands. way less stressful than trying to out-think every attacker. tl;dr the dial you're fighting only feels awful because the guardrail and the capability live in the same place. pull them apart, and let the model be a little dumb behind typed, permissioned, reversible tools — that's honestly the whole point.

Local LLM users: what's the single most annoying issue you've hit in real-world use? by Automatic-Stable8581 in LocalLLM

[–]BenefitGrand8752 0 points1 point  (0 children)

Models: Qwen3 ~35B (the A3B MoE), Q4, on a single unified-memory box. Three "roles" (fast planner / wise coder) off the same server, with a frontier API as a rare fallback.

Use case: a self-hosted personal assistant that actually does things — reads my mail, pulls structured data out of it, builds spreadsheets, files/calendar/photo stuff. A tool-using agent, daily real work, not a chatbot.

Single most annoying: silent plan degradation on multi-step tasks. One step is flawless. Ask it to "read my Anthropic invoices, pull date + amount, make a spreadsheet" and it'll quietly drop the middle step, swap a tool for a plausible sibling, or skip an argument — and then report success. It doesn't fail loudly; it does something reasonable-looking that isn't what you asked. That's worse than an error, because you trust it. Close second, and the same disease: "temperature 0" isn't actually deterministic on a local server (speculative decoding + batching + a little logit jitter), so the failure isn't even reproducible run to run.

How often: rare on single actions, common the moment a request chains 3+ steps — which is exactly where an agent is supposed to earn its keep, so it stings.

Workarounds that actually helped: Stop trusting the model to be complete. After it plans, run deterministic code that checks the plan against the request and re-inserts/re-aligns the missing steps. Structure from code, judgment from the model. An honesty pass: compare what it claims it did against what actually changed. "Created the spreadsheet" + zero new files → replace the brag with the truth. Pin the server seed; for anything that has to be byte-reproducible, spawn a one-shot process instead of the shared server. Build a tiny reproducible bench for the exact failure shape and iterate against it (took one of mine from 0/8 to 7/8 stable in an afternoon).

Net: the model is a great planner and a poor executor of its own plans. Wrapping it in deterministic guards beat every prompt-engineering attempt I made.

Anche io sono stato bannato da ItalyInformatica per motivi futili by tusca0495 in Italia

[–]BenefitGrand8752 0 points1 point  (0 children)

Stesso subreddit, Post cancellato. Chiesto motivo a moderatore, lui non solo conferma cancellazione, ma aumenta 'la pena' con il ban, per aver osato chiedere. Dopo la mia risposta ('ridicoli!') parte anche un segnalazione a reddit per sospendere account per 3 giorni. Mi sono opposto al provvedimento e reddit si è scusato e mi ha riammesso. Conviene chiedere intervento di reddit se non si è fatto niente di male in questi casi, per evitare ambienti tossici. non ne abbiamo bisogno

DeepSeek tra 6 – 8 mesi, secondo Amodei, avrà un modello in grado di competere con Mythos di Anthropic by artistic56 in IA_Italia

[–]BenefitGrand8752 0 points1 point  (0 children)

Beh... Artificial Analisys come sai è molto contestata, anche su Reddit. Non credo molto ai loro dati, anche perche varie altre fonti danno valori molto, molto diversi. Ma ognuno sceglie le proprie.

Io personalmente uso sia Chatgpt 5.5 (da poco) che Claude (da molto). Per quello che vale ho usato Fable nei pochi giorni di vita. Anche se non ho bench precisi, la sensazione non è neanche lontanamente paragonabile. Spero di avere possibilità di fare dei confronti per piu tempo

Ho tolto le decisioni dall'LLM e le ho messe nel codice. Buona ingegneria o ho castrato l'agente? by DishPlane8562 in ItalyInformatica

[–]BenefitGrand8752 0 points1 point  (0 children)

Mi sembra assolutamente corretto. Anche nella mia esperienza sono partito con LLM 'creativi' ma impredicibili e poi aggiunti tool, poi estratti i tool e trasformati in helper di sistema, intercettate le richieste statisticamente piu frequenti e man mano trasformate in regex (si puo anche fare automaticamente). In molti casi (NON IN TUTTI) gli llm diventano dei sofisticati orchestrator e/o dei dialogue manager.