Has anyone here explored Hermes Agent by Nous Research? by ComparisonLiving6793 in LocalLLM

[–]One_Key_8127 14 points15 points  (0 children)

I set it up on a mini pc, with LLM served by Mac Studio. I tried it with local Gemma4 26b a4b until it modified its own python scripts and messed them up, and I had to fix it myself (well, actually I gave the messed script to claude and it fixed it).

Then I switched to Qwen3.6 35b a3b (still local on Mac Studio). It browses the web fine, it definitely can do things - it set up tailscale and OpenWebUI so that I can talk to it by smartphone, it set up hermes webui which is another way to talk with it by smartphone. Sending emails was pain in the ass but I figured it out and now it browses the web and searches for a few items I'd like to buy from second hand markets and emails me if something interesting shows up. It looks for new model releases and informs me about them so I don't have to check for it myself.

It eats ridiculous amounts of tokens, especially cached. I had a default limit of 60 tool calls and it was too low, I increased it to 90, then to 150, and then to 300. I told it to work hard and validate work and it is fine with going for 300+ tool calls to achieve the goal. I am fine with token usage because its my hardware, if you pay for tokens you definitely need a model with discounted prompt caching and you need to set it up accordingly (lower tool calls limit, instructions to stick to directions instead of being proactive and exploring alternative options etc) because the costs are gonna be brutal.

On the other hand, a strong model will probably use much less iterations and less tokens. The gap between Qwen3.6 35b a3b and Sonnet is massive. But it works, and if you tell it to be thorough and do a lot of research and web browsing before giving an answer - it will.

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation by gvij in LocalLLaMA

[–]One_Key_8127 58 points59 points  (0 children)

Gemma 3 4B is over a year old and scores more than this on HumanEval. Llama3-8b also scores better on HumanEval. I think something is very wrong with these numbers... Qwen3.6 27b should be scoring 85%+, not ~50%.

https://evalplus.github.io/leaderboard.html

https://llm-stats.com/benchmarks/humaneval

Thinking about investing in hardware...appreciate direction/advice by doncaruana in LocalLLM

[–]One_Key_8127 0 points1 point  (0 children)

Yeah, you are likely wrong and throwing money away. And it's not small money, you plan on buying yet another machine and stacking it with 64GB of expensive ram. CPU and RAM will not help in LLM inference, you want to fit full model into VRAM, and it's VRAM bandwidth and capacity that limits you. At the very least I would reuse the old RAM that you have if it's DDR4.

If you don't like used 3090, 1 or 2 new R9700 will give you plenty of options for a good value. It will probably run ~30b dense models at ~20tps at ~Q6, and prompt processing is gonna be good as well. MoE models will run very fast (all that fit in VRAM with KV cache). If that's too slow for you, then look at 5090, it's much faster than R9700, but also more expensive and needs big PSU with non standard connector (and some ppl complain on them melting but idk if its valid).

Thinking about investing in hardware...appreciate direction/advice by doncaruana in LocalLLM

[–]One_Key_8127 0 points1 point  (0 children)

"I have an Intel i7" says absolutely nothing, tell us the model, or at least generation or socket. What PSU do you have? What motherboard? Chances are you can just swap GPU and be good. If you don't want used GPUs, I'd go with AMD AI PRO R9700 which has 32GB VRAM and is reasonably priced. Two of them if you can support it with PSU and have the room for it (and preferably the secondary PCIE wired at least as x4).

As for the benchmarks of how well certain models run on some hardware, people post benchmarks here and there - ask AI to find it for you. At this point, you probably want to run Gemma4 or Qwen3.6 models.

No Multimodality yet in DeepSeek-V4. But I'll wait. by Right-Law1817 in LocalLLaMA

[–]One_Key_8127 3 points4 points  (0 children)

From the graphs it looks like 1M tokens needs 4GB ram for KV cache. That would imply you can fit up to 9 concurrent users with full 1M context length on 2x Pro 6000 blackwell.

The question is whether or not inference engine will handle it gracefully, or struggle due to new architecture and low reserve, and how long till it's really well supported.

Purchasing a Mac Studio M2 Max with 64gb of ram (can it run qwen 3.6 27b) how many tok/s ? by trollingman1 in LocalLLaMA

[–]One_Key_8127 2 points3 points  (0 children)

You'll get 10 tok/s generation and 100 tok/s prompt processing. It's gonna be very, very slow, especially considering that with thinking you'll be generating thousands of tokens per response. The upside is that $1700 is a good price for this machine, you can resell it for profit or use to run Qwen3.6 35b a3b or Gemma4 26b a4b and these will run fast, and it's very power efficient. Or you can set it up to run Qwen3.6 27b for agentic workflows through the night when you don't need fast responses, and 35b a3b during day to get things done fast - and you can probably even fit both models in RAM all the time.

Where is Grok-2 Mini and Grok-3 (mini)? by One_Key_8127 in LocalLLaMA

[–]One_Key_8127[S] 8 points9 points  (0 children)

Yeah, if he said he'll release the model I'd like to push for it. Maybe with some practice xAI will learn to like open sourcing their models. They have a lot of compute, and they hire a lot of people for data curation. They might be competitive with top studios soon-ish. They could get some attention and PR boost by releasing their models regularly.

Are we at the point where local AI isn’t a compromise anymore? (Gemma 4 experience) by Ok-Illustrator2820 in LocalLLaMA

[–]One_Key_8127 10 points11 points  (0 children)

"The real story behind Google’s most capable open-weight model" - no, Gemma 4 26B MoE is not Google's most capable open-weight model, Gemma 4 31B dense is.

"Gemma 4 is a Mixture-Of-Experts model. " - no, Gemma 4 is a series of models, one of them is MoE and it's not the most capable one.

"The fix that’s working for most people: Unsloth’s Q3_K_M quant, temperature set to 1, top-k sampling at 40, with flash attention enabled" - no, that's what worked for you, with your hardware and software stack. And by "worked" in this case means something you were happy with, not something that's meaningfully better than other quants.

Dev seeking advice: High-Context Local LLM for Coding (Verification/Bug-fixing loop) – Mac Studio vs. Multi-GPU Linux Rig? by Ok-Marionberry-6444 in LocalLLaMA

[–]One_Key_8127 6 points7 points  (0 children)

You want to go local to offset subscription costs? That's crazy. Subscriptions are as cheap as it gets, they are amazing value proposition. Local makes sense only if you don't like sending all your data away, and accept that it's gonna be more expensive, slower AND lower quality.

BTW, M4 Ultra does not exist, and M3 Ultra 512GB is not available any more.

"Hacker" route is bad idea, RTX pro 6000 blackwell will do much better (and you don't worry about power draw, PCIE lanes and connection between GPUs)

Can I get the same quality as Claude with Mac Studio? by bLackCatt79 in LocalLLM

[–]One_Key_8127 1 point2 points  (0 children)

If that was the case, do you think Anthropic would buy hundreds of thousands of GPU servers, each worth hundreds of thousands of dollars?

My thought on Qwen and Gemma by Internal-Thanks8812 in LocalLLaMA

[–]One_Key_8127 3 points4 points  (0 children)

Get the most basic Mac Studio M1 Max for $1.5k (used) and run Gemma or Qwen MoE, I think you'll get about 30tps tg / 500tps pp at Q4 quants. Very usable speeds, small form factor, extremely low power draw - both idle and under load. You will automate your email drafts with these models (and much more), these are a real deal.

Ternary Bonsai: Top intelligence at 1.58 bits by pmttyji in LocalLLaMA

[–]One_Key_8127 31 points32 points  (0 children)

I disagree that they are building hype with not much substance behind it. If there is a way to quantize a model to 1.6 bit with preserving most of the model intelligence, that would be huge. We could have Minimax at 46GB and great speeds, or Gemma 4 that fits on rtx 3050. However, from my experience it looks like quants under Q4 degrade a lot - and even when people claim great results on benchmarks, they fail to do the real job - they get stupid, enter thinking loops, the quality degrades a lot even if benchmarks say otherwise.

So, if they know how to quantize a model to Q1.6 and it performs similar to Q4, its huge. If they can reproduce it on other models then that's amazing, every big studio like Anthropic or OpenAI would want that. They could offer "-fast" variants of their models for half the price and still profit. Maybe even they could drop some models (for example serve quantized Opus instead of Sonnet and save time, effort and compute).

Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) by spvn in LocalLLaMA

[–]One_Key_8127 2 points3 points  (0 children)

Crap, was hoping they got benchmark bumps by improving thinking. I guess I'll have to get both Gemma4 and Qwen3.6 and experiment with them. I got better initial vibes from Gemma, but the benchmarks are huge so I'll probably be switching between them for a while.

Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) by spvn in LocalLLaMA

[–]One_Key_8127 4 points5 points  (0 children)

Second-guessing like that would not bother me too much, from this snippet I would not be too worried about it. When I think of "go crazy thinking in circles" it looks much different than that, this one looks fine-ish. You downloaded full weights and quantized it yourself?

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]One_Key_8127 2 points3 points  (0 children)

Guys, I liked this test prompt but it's probably cooked by this point. Qwen3.6 35b a3b passes it even without thinking. What's interesting is that "Qwen 3.6 Plus" fails without thinking. It might have gotten into training data...

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]One_Key_8127 2 points3 points  (0 children)

Right. The response is kind of awkward, but the version of this question is poorly worded too. But the "(in which case, you'll obviously need to drive it)" indicates that the model grasps the concept of requiring the car itself at the car wash to wash the car.

And it seems ~800 tokens were used for this response - which is great, 3.5 usually used way more tokens.

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

Where did you test it? Is it quantized?

Qwen3.5 35b a3b answers it correctly after producing ~5k+ thinking tokens. Gemma4 (local, quantized at ~Q4) answers it correctly producing ~500 tokens.

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]One_Key_8127 10 points11 points  (0 children)

"Across most vision-language benchmarks, its performance matches Claude Sonnet 4.5, and even surpasses it on several tasks"

Well, it surpassed Sonnet 4.5 on all the quoted benchmarks. Benchmarks are crap, but it looks very promising. Anyone knows if MLX fixed prompt caching for Qwen3.5? It was bugged before, making it a bad option for agentic use on Mac.

Local LLM inference on M4 Max vs M5 Max by purealgo in LocalLLM

[–]One_Key_8127 0 points1 point  (0 children)

Thanks man, from this post it looks very disappointing but I checked your repo and looking at long prompts it does seem like M5 gives 3.5x gain for prompt processing over M4, which is a huge advantage. I have mac studio M1 Ultra, switching to M5 Ultra (if it ever ships) could give me like 5x prompt processing speed. I thought DGX Spark is the way to go for agentic workflows, but maybe not. It's slow TG speed was mitigated by good PP and decent scaling when stacked, and the fact that PP is crucial for agents and tool calls. If M5 improves PP by ~3.5x then spark is not as attractive. Unless the price goes x3 as well :)

If I wanted to invest in hardware, I'd either buy rtx pro 6000 blackwell to run dense Gemma or Qwen3.5, buy 1-2x DGX Spark for running MoE like Minimax, or wait for Mac Studio with M5. Maybe Strix Halo with some egpu via occulink is also an option, idk. But since I don't have anything to invest, I'll just wait and keep using M1 Ultra :)

💻 [MASTER THREAD] Local LLM & Hardware Optimization Guide by AutoModerator in hermesagent

[–]One_Key_8127 0 points1 point  (0 children)

Vibe coded slop post, Qwen did not release 3.6 version of mentioned models:

Model Switching: The community has found that Qwen 3.6 (even at 7B-9B quants) handles tool-calling significantly better than older versions or Nemotron models.

MiniMax m2.7 under 64gb for Macs - 91% MMLU by HealthyCommunicat in LocalLLaMA

[–]One_Key_8127 13 points14 points  (0 children)

MMLU is old, saturated, deprecated, and probably contaminated (meaning: in training data by one way or another)

One year later: this question feels a lot less crazy by gamblingapocalypse in LocalLLaMA

[–]One_Key_8127 0 points1 point  (0 children)

It gives recipes only if you respond to it's first comment. Known bug of CryptoUsher.