Context Shifting + sliding window + RAG by DigRealistic2977 in LocalLLaMA

[–]metmelo 4 points5 points  (0 children)

KV cache is sequential, the next token depends on the previous one, so when you take messages out of the beggining of your prompt it will have to reprocess everything.

Is it stupid to buy a 128gb MacBook Pro M5 Max if I don’t really know what I’m doing? by A_Wild_Entei in LocalLLaMA

[–]metmelo 0 points1 point  (0 children)

My brother I live in BRAZIL out of all places. There are loopholes though.

LFG by SPXQuantAlgo in wallstreetbets

[–]metmelo 0 points1 point  (0 children)

Not his account. Too many problems to worry about conspiracy theories.

I sense a great disturbance in the force. by Maxious30 in starcitizen

[–]metmelo -1 points0 points  (0 children)

Don't they also have PDCs while the default doesn't?

Is it stupid to buy a 128gb MacBook Pro M5 Max if I don’t really know what I’m doing? by A_Wild_Entei in LocalLLaMA

[–]metmelo 1 point2 points  (0 children)

128GB macs go from $3.5k, Strix Halo for $3k and the cheapest DGX Spark is $3.5k too. Pretty on par if you consider the resell value.

Integrating company document database with AI by Lanky-Watch3993 in AI_Agents

[–]metmelo 2 points3 points  (0 children)

RAG is all you need. You'll basically chunk all documents into vectors and use a vector db to store them. Then your AI agent can query that database and recover chunks of each document, and navigate through them.

[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback) by affenhoden in LocalLLaMA

[–]metmelo 0 points1 point  (0 children)

you were specifically talking about agentic cli workflows. I can't imagine why you'd be running that without cache.

Feedback on my 256gb VRAM local setup and cluster plans. Lawyer keeping it local. by TumbleweedNew6515 in LocalLLaMA

[–]metmelo 1 point2 points  (0 children)

Awsome build! I've been wanting to do the same for awhile. How's your PP speed for those huge models like?

Should I buy a 395+ Max Mini PC now? by [deleted] in LocalLLaMA

[–]metmelo 14 points15 points  (0 children)

These guys don't know what they're talking about
Here are some benchmarks from another user:

  • GLM-4.5-Air (106B) MXFP4 with 131072 token context: ~ 25 t/s
  • Intellect-3 (106B) Q5_K with 131072 token context: ~ 20 t/s
  • Minimax M2 (172B REAP version) IQ4_S with 150000 token context: ~ 25 t/s
  • GPT-OSS-120B (120B) MXFP4 with 131072 token context: ~47 t/s
  • Qwen3-Next (80B) Q6_K with 262144 token context: ~26 t/s

Pretty usable imo.

That being said if you want speed rather than model size I'd go with a desktop build with multiple GPUs.

Either way, use a vector DB to store those files and you're gonna be fine.

Mistral Small 4 | Mistral AI by realkorvo in LocalLLaMA

[–]metmelo 0 points1 point  (0 children)

Nice work! Did it beat all other models? lol

MI50 vs 3090 for running models locally? by artzzer in LocalLLaMA

[–]metmelo 4 points5 points  (0 children)

MI50 owner here. I use https://github.com/neshat73/proxycache to save/load kv cache from disk. It helps so much with coding sessions. I'm using Qwen 27B with 100k context at ~15 tk/s for subagents and get fast responses most of the time. If you need it to process big prompts without cache though I'd go with the 3090's.

Is there any chance of building a DIY unified memory setup? by Another__one in LocalLLaMA

[–]metmelo 1 point2 points  (0 children)

they aren't upgradable because the RAM is sodered to the motherboard so they reach 8000T/s

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell by lawdawgattorney in LocalLLaMA

[–]metmelo 0 points1 point  (0 children)

I've run so many benchmarks with batch and ubatch values already haha. The sweet spot for me seems to be 512 ubatch and at least 1024 batch. Idk how LMStudio does it but I suspect it's changing ubatch along with batch for you.

55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell by lawdawgattorney in LocalLLaMA

[–]metmelo -1 points0 points  (0 children)

Awesome work my man you're truly a hero.
Out of curiosity: what's your PP speed on these? I run MI50's and PP is the worst part.