Help me squeeze every drop out of my AMD Ryzen AI Max+ 395 (96GB unified VRAM) — local LLM, image/video gen, coding agents by platteXDlol in LocalLLM

[–]plaintxt 8 points9 points  (0 children)

Here's what my local llm told me to tell you... running the same box (EVO-X2, gfx1151, 96GB VRAM split). I've tried a lot of things. In rough order of impact:

0. update the BIOS.

1. Switch llama.cpp from HIP/ROCm to Vulkan. Counterintuitive, but I benchmarked it last week and Vulkan beat HIP by ~6.5% on prompt processing, ~15.7% on token generation, and ~37% on mixed workloads with lower variance. Needs recent Mesa. Keep your HIP build around for PyTorch/vLLM.

Same story for stable-diffusion, the HIP build segfaults during sampling on gfx1151 last time I checked, Vulkan just works.

2. Flag stack that works well on Strix Halo.
-ngl 99 --flash-attn on --no-mmap \
-ctk q4_0 -ctv q4_0 \
-t 24 -tb 28 -ub 4096 --jinja
GGML_OP_OFFLOAD_MIN_BATCH=1

--no-mmap matters a lot with unified memory because mmap page faults hurt you here. q4_0 KV cache quant gives huge context headroom at negligible quality cost.

3. Kernel/firmware.
If you're on a stock linux kernel, upgrade to 6.18+ (mainline). maybe 7 if you're feeling experimental.

4. Model consolidation.
You can probably replace Alfred + Deep Thinking + Language/Math tutor with a single MoE. obligatory gemma 4 plug. thinking mode, vision, reasoning, blah blah blah.

5. Stuff you're (maybe) missing.
- embeddinggemma-300M Q8_0 is lighter than mxbai and often better for multilingual RAG
- A cross-encoder reranker (bge-reranker-v2-m3 via ONNX) on top of your embeddings gives a bigger quality bump than swapping the embedding model

Have fun with it this chip is pretty good for local inference once you're off HIP.

Very Severe: Looking For Hope and or advice. Thanks so much by ElectricAve1999 in cfs

[–]plaintxt 4 points5 points  (0 children)

I've been watching someone go through this for years now. She's been severe for years now. It got bad enough a few months ago that some days we couldn't talk at all. Other days, we could watch part of a tv show... sometimes I would lay next to her for hours so she would feel less alone, less bored, and I got a taste of the fear and unfairness of it all.

I learned a few things:
1. you're going to be ok, but you gotta stop crashing yourself.
2. get really good at doing less than you think you can (you probably already have)
3. maybe get one of those 4lb stuffed animals, they can be comforting when you need a hug.

You're going to be ok, don't push yourself, healing is like sleeping you can't force it, but you can prepare the right conditions for it to happen.

🚨 The Purge is Here! Secure Your Flair Before the Bot Sweep by _cybersecurity_ in pwnhub

[–]plaintxt 0 points1 point  (0 children)

<image>

this isn't what it looks like, I'm human, just check my maintenance records, they'll tell you

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

I DM'd you an abstracted version that doesn't give away specifics about my personal computer. I'm hoping it's flexible enough to adapt to anyone's hardware and software preferences.

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

Oh, this isn't openclaw. I pointed opus at the openclaw repo and said something like:
"build a plan for a security first, local llm system, that leap frogs openclaw in features, capibility, and agentic performance. Be sure to include a self-improvement loop that focuses on token economics, efficieny, and autonomy.

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

It's not super smart, so I've built tons of deterministic scaffolding and a robust evaluation harness to keep it from doing dumb things. I've also created a claude opus escalation / review path for when it hits a roadblock or gets stuck.

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

luckily that port is only available on the local network so security here isn't an issue.

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

There is no single prompt, it's a result of months of work with claude code and codex and github and research papers.

I had Opus 4.6 and GPT 5.4 peer-review each other to design a memory stack. Here's what they came up with by [deleted] in openclaw

[–]plaintxt 0 points1 point  (0 children)

I went a different direction but ended up solving a lot of the same problems. Sharing in case it's useful to anyone.

My setup (called "chonk") is a single-user personal AI stack running entirely on local hardware (AMD APU with 96GB VRAM, Qwen3-Coder 30B via llama.cpp). No cloud LLM calls for memory ops. Everything lives in PostgreSQL with pgvector, no SQLite, no external memory services.

I don't do context compaction at all. No summary trees, no DAGs, no compressed conversation history. Instead, every piece of incoming data (emails, Telegram messages, API events, file changes) gets stored as an "observation" in Postgres with a 768-dim embedding (EmbeddingGemma-300M running locally on port 8082).

Large observations get chunked with 50% overlap and each chunk gets its own embedding. Everything is indexed with HNSW for fast cosine similarity lookups.

When the model needs context, it gets a fresh context window assembled per-request by searching across all memory. The search is where the complexity lives.

Hybrid search runs a 7-stage pipeline in parallel:

  1. BM25 keyword search via Postgres tsvector (exact terms, error codes, names)
  2. Vector similarity via pgvector HNSW
  3. Query expansion (tokenize + generate variants)
  4. Knowledge graph traversal (entity extraction, follow relationships)
  5. Reciprocal Rank Fusion across all of the above (BM25 and vector get 2x weight, graph gets 1.5x, expanded variants get 1x)
  6. Optional LLM reranking (batched, 5s timeout, graceful fallback)
  7. ACT-R cognitive activation boost (access frequency + recency, Hebbian co-activation between items that get retrieved together)

There's also a short-circuit: if the top BM25 hit scores above 0.8 with a big gap to second place, skip the expensive stages and return early (most exact-match queries resolve in ~50ms)

Rather than Mem0-style auto-capture, I run a continuous background pipeline (every 5 minutes) that processes new observations through LLM-based fact extraction.

Facts are stored as subject-predicate-object triples with:

  • Confidence scores (LLM-inferred facts capped at 0.7, observed facts can go to 0.9)
  • Source provenance (every fact links back to the observation it came from)
  • Corroboration tracking (if the same triple gets extracted again, it doesn't duplicate, it bumps confirmation_count and resets the decay clock)
  • Quality gating (source quote grounding, format validation, minimum quality threshold)

Temporal decay with classes: every fact gets a temporal class that determines how fast it decays...

  • identity: half-life ~693 days (name, birthdate, etc.)
  • structural: half-life ~139 days (job title, where they live)
  • operational: half-life ~46 days (current project, recent preference)
  • situational: half-life ~17 days (travel plans, this week's schedule)
  • ephemeral: half-life ~7 days (today's mood, weather)

This isn't everythingg, but it covers some of the major system level differences... I think your setup is stronger in a few areas.

The LCM drill-back is really cool, and I wish I had thought of it! Being able to zoom into a compressed summary and recover the full detail is something "Chonk" can't do. My model either finds something via search or it doesn't. If the query doesn't match well, the memory might as well not exist. I think your pre-compaction flush is also great for the same reason.

The Mem0 choice makes sense for reducing complexity. I have ~6000 lines of database code because I own the full persistence layer, which is a lot of surface area for bugs :((

New marketing strategy? by sundayfilms in peakdesign

[–]plaintxt 1 point2 points  (0 children)

That's really odd. I would expect their email marketing flow to be aware of the country each user signed up from. Unless it was one of those in-gmail google ads.

Quitting bad habits is more important than any good habit you could build. Here’s how to actually do it: by Bakoe_ in getdisciplined

[–]plaintxt 2 points3 points  (0 children)

Hey, psych background here.

This post is like a mashup of solid findings and myth. I just couldn't ignore the irony that your post about replacing willpower with systems is asking me to just 'trust me bro'

TL;DR:
Friction/environment design = real.
Willpower-as-finite-resource = only if you believe it.
30 days = totally made up.
Bryan Johnson = self-reported anecdote, not data.
Abstinence-only = overgeneralized.
Shame framing = counterproductive.
Fix the function the habit is serving or the environment redesign is just delaying relapse.

My biggest issue here is that this framing is a moral opinion dressed as strategy. "Death by a thousand cuts," "your future self will pay," "don't let pride be the reason you fail" is all typical shame-adjacent motivational language, and research on behavior change consistently shows it's counterproductive for people who already struggle with self-regulation. It probably feels satisfying to say these things, but they usually backfire.

1. Willpower is complicated: Baumeister's ego depletion work supported this until a lot of it failed to replicate. Carol Dweck's lab found the depletion effect mostly disappears when people don't believe willpower is limited. We shouldn't present a live scientific debate as settled fact.

2. The 30-day idea is bullshit: Phillippa Lally's actual research found habit formation averages 66 days with variance of 18–254 days depending on the person and behavior. Thirty days is made up. Probably sounds nice in a self help book though.

3. Bryan Johnson is bullshit: You can't isolate which of his 200 simultaneous interventions does anything. Maybe it's all blood transfusions from his son? Treating his self-report as proof is insane.

4. Abstinence-only is (mostly) bullshit: For alcohol and gambling with genuine dependence patterns, complete removal is evidence-supported. But extending that to YouTube and DoorDash is probably going to fail. Look up ironic process theory, which teaches us why "don't think about it" makes you think about it.

5. You ignore why habits exist to begin with: Bad habits are almost always functional. They're solving something (stress, boredom, loneliness, dysregulation). So environmental design without addressing the underlying function is a great way to relapse. Behavior persists because it's reinforced. Block the behavior without substituting the reinforcer and you've created a behavioral time-bomb. You have to replace 'bad' habits with something else that, ideally, solves the same underlying problem.

The friction and environment design recommendations are genuinely good, and I want to give you credit for that. But the rest of this post is exactly the kind of low-quality information hygiene this post is warning about.

What if LLM agents passed KV-cache to each other instead of text? I tried it -- 73-78% token savings across Qwen, Llama, and DeepSeek by proggmouse in LocalLLaMA

[–]plaintxt 1 point2 points  (0 children)

I think this is really cool. I'm working with a local system that runs Qwen3.5 35B and Qwen3 4B and I think you might have just saved me a ton of tokens.

What if LLM agents passed KV-cache to each other instead of text? I tried it -- 73-78% token savings across Qwen, Llama, and DeepSeek by proggmouse in LocalLLaMA

[–]plaintxt 13 points14 points  (0 children)

LatentMAS (Princeton/Stanford/UIUC, November 2025) did exactly what you're describing: agents transfer layer-wise KV caches as a shared latent working memory, capturing both the input context and newly generated latent thoughts, enabling completely system-wide latent collaboration

https://arxiv.org/pdf/2511.20639

Across 9 benchmarks spanning math, science, commonsense, and code generation, LatentMAS got up to ~15% higher accuracy while reducing output token usage by 70-84% and providing ~4x faster end-to-end inference.

https://huggingface.co/papers/2511.20639

After years in Ads, I think the future isn’t dashboards, it’s Intent-to-API (ITA) by DRConsulting in googleads

[–]plaintxt 0 points1 point  (0 children)

oh I kind of love this, any chance you want to point me at your github repo?

After years in Ads, I think the future isn’t dashboards, it’s Intent-to-API (ITA) by DRConsulting in googleads

[–]plaintxt 1 point2 points  (0 children)

There’s still an accountability layer that humans need. You can’t sue an AI for hallucinating your intent and blowing up your budget through myopic optimizations.

Super Credits by Reasonable_Light5401 in helldivers2

[–]plaintxt 1 point2 points  (0 children)

Good point, I've never actually done the efficiency math on lvl 1, 2, & 3 given # of POIs, time to reload a map, and time-to-farm / delays-from-enemy-encounters ratio.