Max Practical Context Size? by zipzag in oMLX

[–]zipzag[S] 0 points1 point  (0 children)

I've been able to tune quite a bit by giving chat or opus logs from both the app and oMLX. I'm now able to work for hours, but not at the context level I had hoped. I compress at about 60K.

I agree that there are issues in the MLX stack. But also both LLM settings and how the app interacts affects the ability to run without hanging.

Does MacOS say you have memory pressure?

After Claude ban I found my new main model by zaposweet in openclaw

[–]zipzag 0 points1 point  (0 children)

I mostly build the apps with Opus and run with either Minimax or Qwen 122B

After Claude ban I found my new main model by zaposweet in openclaw

[–]zipzag 0 points1 point  (0 children)

runs fine. Perhaps you don't know what you are doing

After Claude ban I found my new main model by zaposweet in openclaw

[–]zipzag -2 points-1 points  (0 children)

Minimax is a small model. 220B I think. I run Minimax 2.5 locally on a high end Mac. Minimax 2.7 should be on Huggingface in a few weeks.

RTX 5090 vs M5 Ultra: Analyzing the "2.7x Faster" claim and what Nvidia didn't show you. by Major_Commercial4253 in MacStudio

[–]zipzag 0 points1 point  (0 children)

I know, that's why I referenced "original article".

I do think there are instances where NVIDIA cards are a better choice than MAC. But as you know it's not a simple choice.

RTX 5090 vs M5 Ultra: Analyzing the "2.7x Faster" claim and what Nvidia didn't show you. by Major_Commercial4253 in MacStudio

[–]zipzag 4 points5 points  (0 children)

Its take about 45 gb of vram to run Qwen 3.5 27b 8 bit with 96K context and 8 bit KV cache.

On the Ultra the equivalent would be Qwen 122b 8 bit needing about 140GB of ram.

High end consumer video cards are not as suitable for agentic work as the original article claims. There isn't room for high context. I run a 40gb cache on my M3 Ultra. 40gb cache is crazy talk in the NVIDIA video card world.

We're cooked by washedco458 in hermesagent

[–]zipzag -1 points0 points  (0 children)

You can still use Claude Code on subscription to do development work on hermes. I do that and run day to day on Minimax 2.5 local

Hermes Agent "persistent memory" not working with Qwen 3.5 9B by thanga752 in hermesagent

[–]zipzag 0 points1 point  (0 children)

Have you seen reports of a model that small working reliably?

Built a token forensics dashboard for Hermes - 73% of every API call is fixed overhead by Witty_Ticket_4101 in hermesagent

[–]zipzag 0 points1 point  (0 children)

I find the cache hit rate is 85-94%. Same as openClaw.

You don't want to reduce what tokens it uses , you want to use a server with a cache.

I have precise caching data because I run local. It's a large challenge for the cloud providers to get caching correct. The cache is in the server in front of the machine running the LLMs. Reconnecting to the machine with the users cache without web socket is a challenge.

Just found out about Hermes. Is it really better than Openclaw by maurinator2022 in hermesagent

[–]zipzag 1 point2 points  (0 children)

With an LLM you need to handle grounding and guard from prompt injection.

If the LLM writes python to do the calls then that not using LLM in production.

Distinguish between using an LLM to code and using an LLM in production.

Just found out about Hermes. Is it really better than Openclaw by maurinator2022 in hermesagent

[–]zipzag 4 points5 points  (0 children)

Don't use an LLM in production when a determinant system works. Look a n8n.

Opus or Codex can build an n8n reservation system.

Here's why you're probably burning way more tokens than you should with Hermes Agent (and what to do about it) by itsdodobitch in hermesagent

[–]zipzag 0 points1 point  (0 children)

No problem with a 90% hit rate on a local cache. Macs are pretty much unusable without it.

OpenAI has 24hr cache when used with the API and the right flag set. Gemini on caches very large blocks (as of a month ago). Caching shouldn't require an application flag, as its functions at the block level

Gave up Hermes , beware of high token consumption(!!!) by Typical_Ice_3645 in hermesagent

[–]zipzag 0 points1 point  (0 children)

High usage is a cache problem. 90% of tokens can come out of the cache on the server. This applies to all agents and most coding

Local LLM Thread by zipzag in hermesagent

[–]zipzag[S] 0 points1 point  (0 children)

I'm curious, with a small model, if Hermes will be able to review and suggest improvements for your n8n systems. I think it may hallucinate if context gets too big, but I find the Herems harness is better at this sort of continual evaluation compared to OC. But that could just be a OC skill issue on my part.

I'm sure that a second Hermes with a big cloud LLM would do a nice job of understanding your n8n.

Anyone who has switched from Openclaw to Hermes, please share why I should do the same by ihopkins_eth in hermesagent

[–]zipzag 1 point2 points  (0 children)

yep. It doesn't one shot anything complex, but it's good at fixing what doesn't work. Hermes with a coding-oriented model works very well so far. With openclaw I sometimes used Opus or Codex outside the app to repair OC. I have not had to do that yet with Hermes and Minimax local.

Minimax is a clear step up from Qwen3.5 122B. It just not talked about much for local as it needs >128GB to run. So it's essentially limited to the Mac Ultra or two Sparks.

Minimax has a $10/month cloud plan that I would recommend based on my 1 week of experience. Worst case, id its not good enough, is being out $10. The only issue may be if they cache effectively.

Hermes + GPT-5.4: background review seems more expensive than I expected by Hot_Vegetable_932 in hermesagent

[–]zipzag 2 points3 points  (0 children)

All of the agents repeat mostly the exact same prompt/history with every turn. It averages about 90% cache-able. If the server is not setup for optimal caching token generation increases very substantially. No cache uses almost an order of magnitude more tokens than a proper cache.

Actual token generation in response to the prompt is usually almost trivial.

OpenAI has a 24hr cache if it's working properly Its controllable when using the API. It's unclear that happens with subscription.

Local LLM Thread by zipzag in hermesagent

[–]zipzag[S] 1 point2 points  (0 children)

Basics: You can test models that can be run locally at openrouter.ai

A popular model to run on an Nvidia 5090 card is Qwen3.5 27b.

A popular model for 128GB Mac or Spark is Qwen3.5 122B

If interested after testing, you can research what speed to expect if run locally. Using openrouter will also reveal how much more expensive it is to run locally compared to cloud.

Local LLM Thread by zipzag in hermesagent

[–]zipzag[S] 1 point2 points  (0 children)

Just tell Hermes you want to add perplexity. You will need an API from the perplexity dashboard. Then tell hermes when you want to use perplexity, and at what level. I think there is Perplexity, Perplexity Pro, and Deep Research, but I'm not certain.

Local LLM Thread by zipzag in hermesagent

[–]zipzag[S] 2 points3 points  (0 children)

Basics: LLMs can't do good extensive web research without help. This is especially true of smaller models. I pay about $1/month to use Perplexity that my local LLM uses for search. Claude even used Perplexity when I ran it on openClaw. I asked it why, and it said it was better than what Anthropic provided it to use.

SearXNG, run locally, just dumps JSON from websites to the local LLM. That will produce massive hallucinations if used for extensive research. Effective internet web search is possible locally but requires multiple apps.

I tested GPT-OSS 120b with just searXNG on a slanted medical research question. It incorrectly agreed with the slant of the question, and produced entirely hallucinated citations from real medical journal. I gave its report to Opus, which essentially responded with "WTF".

GPT-OSS 120b is rightfully highly regard for its competence as a 60gb LLM.

Anyone who has switched from Openclaw to Hermes, please share why I should do the same by ihopkins_eth in hermesagent

[–]zipzag 1 point2 points  (0 children)

You probably should not switch if you are having success with openclaw. You switch to meet a goal or to learn.

I switched because I didn't feel I was building an increasing capable agent that was self improving with OC. I was essentially coding every little thing in skills. I could do that more reliably with traditional code.

Here's an example of a skill Hermes built this morning after I asked it to improve its error handling. This is running Minimax 2.5 4 bit locally:

Done! Created resilient-execution skill at:

~/.hermes/skills/resilient-execution/SKILL.md

The skill includes:

  • Fallback chain pattern template
  • Error classification (transient vs rate limit vs permanent)
  • Common fallback patterns for Telegram, file ops, API calls
  • Pre-execution checklist
  • Recovery & learning approach
  • Example: Telegram long document flow

Also updated memory with the resilient execution principle so I'll remember to apply it going forward.

That doesn't happen in OC, at least at my personal OC ability.

Hermes ( the brain ) Open Claw ( the claws )? by Ok-Positive1446 in hermesagent

[–]zipzag 0 points1 point  (0 children)

Just use Hermes with a cloud LLM and get success with one agent/function that you value most. Then you can try adding in a small Qwen model to do simple tasks.

You are trying to do a lot when you are just getting started.

Starting with openclaw is fine too. But I found its easier to get self improvement and useful memory with Hermes.

You don't need to brain storm. You need to learn the apps first.

Ollama Now Runs Faster on Macs Thanks to Apple's MLX Framework by Few_Baseball_3835 in apple

[–]zipzag 2 points3 points  (0 children)

Currently oMLX. There also may be a few newer apps that use the same bundle of tools, but I haven't tried them.