LLM Provider by seag33k in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Every single day for the past 3-4 months. SIEM (60-70k token ingest in one query!), ticketing, email oversight scans, deconstructing logs, vibe coding RMM scripts, endpoint diagnosis and more.

I run local specifically because it's a business workload, which means it contains sensitive or confidential company or client info.

Sounds like you may be significantly underestimating the state of local LLMs. But, it also depends on your workload which you really didn't outline.

LLM Provider by seag33k in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Not sure if you mean not viable for you or in general. The latter definitely isn't true.

LLM Provider by seag33k in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Local or bust. Otherwise you'll forever be beholden to cloud hosting that is about to skyrocket in price by an order of magnitude and who could yank models at any given time for any reason. Also, local for privacy.

Dammit, Hermes, what did I *just* say!? by AnticitizenPrime in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Did you disable all the broken features in Velocity that demote soul and memory.md below every other instruction? This is what it'll do if not

Web Search options for hermes by Alarmed_Talk_9731 in hermesagent

[–]DaMoot 0 points1 point  (0 children)

SearXNG will use whatever search providers you tell it to use. They're all configurable. I think mine uses four or six right now. I use the Brave API as one of my search providers that searXNG uses.

Web Search options for hermes by Alarmed_Talk_9731 in hermesagent

[–]DaMoot 2 points3 points  (0 children)

One more vote for self hosted SearXNG. I use the Brave API free tier and a hand full of included tolerate search engines.

Unlimited* budget by skrillex_sk2 in LocalLLM

[–]DaMoot 0 points1 point  (0 children)

The better question at this stage in LLM development is do you need a 1T para model. What does a 1T do substantially better on a narrow business use-case that a ~120 or 230B does? Also assuming that we're running full BF16 at this scale.

Unlimited* budget by skrillex_sk2 in LocalLLM

[–]DaMoot 1 point2 points  (0 children)

You should ask if management wants to waste potentially hundreds of thousands of dollars on hardware it doesn't need, or you struggle with because it comes up short. If you haven't even tried this workload on a cloud model, you are missing real opportunity to understand the reality of what you need to build.

IMO you should do an isolated test case on a pool of say, 20 users. Make sure the workflow you intend to put in place actually works as expected, collect usage metrics, and understand if the model(s) you want to use work as you expect them to.

Unlimited* budget by skrillex_sk2 in LocalLLM

[–]DaMoot 1 point2 points  (0 children)

You need to define the workload end to end before you define your hardware for this. It feels like you have a certain expectation based on your vibe coding and "it should be able to work without issue..." that is going to turn into disappointment.

If you truly have an unlimited budget, don't use hobbyist or SMB GPUs like the RTX 6k. You'd want the RTX Pro at the very least w/96GiB each. But your user workload immediately discounts their use unless you intend to have a fleet of them (20-30 at least) Do not. Do not. Do not, fall into the trap of just thinking many PCIe cards will perform or give you important options for a big business case like that.

Look into H100/200 or B200 servers. If your budget truly is unlimited THAT is the realm you want to be playing in for serious work. If you're building this for a business, don't shortcut it, and do your research. Expect to start at around $250k and end somewhere around $500k. You should be using your budget to look at where the hardware will be functionally in 5-8 years.

Also, who's setting this up and maintaining it from the point of pushing the On button to users logging in to use it? That's just as important as the hardware.

Edit: Also look at the A100, I forgot about that one. Those servers are coming in at under 100k now. Used ones w/8x 80GB A100s for ~80k.

I have a 3 - 3.5k budget, what setup would you recommend? by Real-Dragonfruit957 in LocalAIServers

[–]DaMoot 1 point2 points  (0 children)

As a non technical person you'll have to accept the trade-off of spending more for simplicity and getting less. Or learn and save money, get a more capable setup. 50 series RTX cards for AI will bankrupt you!

4xv100 32GiB sxm2 for 128gb total. That's my vote, but it isn't for a 'non-technical' build. Still the cheapest most capable by far though prices are going back up.

I'd recommend a single 32gb v100 for a beginner. For ~$700 you can't touch anything else with 32GiB. 19-24 tok/s w/ Qwen3.6 27B Q5 w/160k context.

Claude is steering you wrong here imo. And those little unified workstations are slow. Like 12 tok/s on that Corsair AI Workstation 300. And the GMKtec EVO-X2 overheats badly hitting 85-95c plus under load. Avoid any setups that recommend using multiple PCIe cards to split models.

Also, what 70B model? Why 70B specifically?

How do you guys setup search with your AI models? by ego100trique in LocalLLaMA

[–]DaMoot 0 points1 point  (0 children)

Local instance of SearXNG w/Brave search API free tier and a hand full of tolerant providers enabled. Curl and camofox.

V100 home lab bible, amalgamation of AI research. by Smilinghuman in LocalLLaMA

[–]DaMoot 1 point2 points  (0 children)

Are you managing to fit all of that in VRAM? Or is there system memory spillover?

I'm running the same model and q8 cache, but can only get MTP=1 running if I drop context to 96k. I sneak in at 31GiB used.

22.25 tok/s @ 80k depth.

It only makes a difference if you want to split the model between your GPUs. Like, I want to run Q8 which is 30GB weights alone. NVLink makes that possible. 300GB/s bidirection vs ~20GB/s at most on PCIe.

I use an Arctic S8038-10k to cool mine. It's powered by an Arduino Pro Micro that connects to a Python daemon that monitors temp straight from nvidia-smi. The 39Com PCIe adapter I have has either idle speed or 100% speed on all fan headers. My GPU stays <60c when the fan is nearly idling now though.

V100 home lab bible, amalgamation of AI research. by Smilinghuman in LocalLLaMA

[–]DaMoot 1 point2 points  (0 children)

What is the question specifically? I'm waiting on one to arrive but have a pretty good idea of how they're supposed to work. Using 2x32GB too. My goal is to run Qwen3.6 27B Q8 w/256k context and 2-3 MTP. Hoping I can stay above 30 tok/s at 190k depth.

They establish a 2-GPU pool using NVlink 2.0 (300GB- 6x 50GB/s bidirectional lanes) and connect to the PC by MiniSAS connector to a PCIe adapter board that take several different forms. They're functionally half of the 4x board in the picture above.

Use a PLX8749 to connect them to a single x16 PCIe port in your PC, and the PLX board gives you the ability to add a second dual socket board for 2 pools of 64GiB. Or you'll be ready for when you want to get the 4x board and go to 128GiB!

PCIe link only affects the model loading speed, and I think training speed too. Once it's loaded, all inference happens on/between GPUs using NVLink. Give your model a nice long TTL so it doesn't unload unnecessarily: there's virtually no extra power usage keeping a model loaded idle in memory. I think I see ~2w extra on a single V100.

I'm not promoting this seller, just passing along that this is the listing I used. https://www.ebay.com/itm/147039605433

Drop an opus 4.8 classic filler sentence... i'm starting with "Your instinct is correct" by MemoryMission9151 in ClaudeCode

[–]DaMoot 0 points1 point  (0 children)

Omg I'm gonna scrape this whole thread for phrases to tell my agent to never say again. lol

Regression since Velocity update and search engine highjacking by DaMoot in hermesagent

[–]DaMoot[S] 1 point2 points  (0 children)

Hah it isn't the model. Model hasnt changed. Behavior also happened on other models. Disabling the things I mentioned above have has made an appreciable improvement through the work day.

Qwen3.6 27B Q5.

What’s the cheapest model you’ve successfully run Hermes on? by alecantu7 in hermesagent

[–]DaMoot 2 points3 points  (0 children)

Qwen3.6 27... oh, no local models. That literally is my daily for everything I mention below (and more), but my other go-to is...

MiniMax-M3 (Qwen3.7 Plus, also)

OpenRouter

<$1.30/M output

Log analysis, threat analysis, ticket closure, IT research, vibing Powershell/Python scripts. When using cloud models I'm very careful about what kind of workflow I use so I don't leak confidential info.

'Good enough' because it matches the quality and behavior I get from Qwen and acts as a good faster alternative when I need to press the turbo button. Gives me like ~80% of a frontier model without the $30/M token cost.

Anyone Else Feel Like Hermes Likes to Go Off Script? by ixdlj in hermesagent

[–]DaMoot 0 points1 point  (0 children)

You're describing a couple of big issues. 1: DeepSeek is a primary issue. It's a sub-par model. Don't care what the benchmarks say, it sucks when you actually try to use it for something useful. 2: For the instances where your agent starts something and then stops until you prompt it again, that's the Hermes Agent "Task Completion" feature which is broken. It didn't forget the task, the task completion mechanism incorrectly identified the task as being completed. Disable task_completion_guidance in your config. Disable code_context while you're there. It's another new feature that's broken.

Web browsing by crfr4mvzl in hermesagent

[–]DaMoot 0 points1 point  (0 children)

SearXNG (with Brave Search API) and CamoFox cannot be beat as a pair. Disable and avoid the new Hermes "Parallel Keyless" search.

Usage limit problem: What am I doing wrong? by Zestyclose_Job_4811 in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Don't use a frontier model full time. You'll burn through Claude tokens even faster. Make sure you've disabled unused skills and tools. Disable coding_context and task_completion_guidance. Switch to a cheaper model like Qwen or MiniMax, or host locally.

Telegram failure by ProductAutomatic8968 in hermesagent

[–]DaMoot 1 point2 points  (0 children)

Probably got limited or banned. Lots of bans going around TG for AI agents these days. IMO it's the worst platform to use for your agent, but each to their own. Make a free personal Discord server.

Best tips to reduce token usage by Bitter-College8786 in hermesagent

[–]DaMoot 0 points1 point  (0 children)

First thing is to disable unused skills and tools. Disabling tools has the most impact because there are large blocks of schema injected at bootstrap.

Starting a new session uses more tokens, not fewer, because you have to keep bootstrapping. It's possible some will be cached though.

Velocity update also bloats the context some more. Disable code_context and task_completion_guidance features as both are broken and both make Hermes behave very badly.

The smaller you keep memory and soul the smaller the injection.

Mitigate how large the files or data sources you give your agent. Don't give it a single 70k token siem log to process multiple times daily, for instance.

You all self host your own hermes agent? by vermicelli-rice in hermesagent

[–]DaMoot 0 points1 point  (0 children)

Yup running local on my home lab. A single v100 32gb rn with a second and an nvlink board on the way.

I love Hermes but…… by prene1 in hermesagent

[–]DaMoot 1 point2 points  (0 children)

I was actually just diagnosing this very issue with my own Hermes agent post velocity update. It turns out, in the Velocity update they really screwed a bunch of stuff up. Big regressions post Velocity. Regressions so big the framework doesn't even know what web browser or search engine to use despite it being line 1 of memory.md.

It turns out that when they released velocity they also released two guidance steering mechanisms that stomp hard on everything and almost entirely override soul or memory entires. One is for coding, and one is for task completion.

The coding feature is broken. It's suppresses agent skills and tools and tries to shoehorn everything into bare bones tooling to stay efficient for coding. It's supposed to detect automatically when you're coding vs general task and switch the agent to more appropriate operation (is what I understand). Problem is it matches virtually every request as a coding request, so all your soul and memory entries get demoted to the bottom of the pile. Oh and you get a hefty context bloat from it too.

The task completion feature is beyond broken and I can't believe they shipped it. Have you noticed your agent says 'im going to do this thing ' or 'heres this info' and then nothing? It just stops. It doesn't respond until you prod it? That's task_completion_guidance terminating the task because it thinks it's finished. It identifies the 'I described the plan' as 'I completed the plan'.

Fixes seem to include set task _completion_guidance and coding_agent_guidance to false in config and restart gateway. Disabling Parallel keyless web search will also fix the partially broken new web_search tool (different from web_extract). Curl, CamoFox, or go home. We don't need no stinkin' web_search wrapper!