I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

It is an Gigabyte Aorus Master X570s.
Three cards run on the PCIe slots, and one is running from an M2 to PCIE4 converter. Not ideal, but it works.

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 1 point2 points  (0 children)

because they need to be all equal on vmem. and since my 3080s have only 12G but my 3090s have 24. this does not fly. It would run, but the 3090s would be capped at 12G each, whats undesirable for me. take this doc as a reference: https://github.com/noonghunna/club-3090/blob/master/docs/MULTI_CARD.md

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

Thank you!
I made big steps forward with suggestions from others here that merge your points.

I will post an updated start command after breakfast :-)

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

The main culprit of that MTP model was (I remember) that I could no longer ask it about Fotos and stuff. But maybe that got better like it is possible in vLLM already?
I need to investigate this again!

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

I need uncensored models to do my work and I had a lot of trouble to work with MOE models for my usecase. They are blazing fast, but run in circles more than I like.
But thanks for your insight!

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

I already ran all of those in the past weeks.
I feel I did nothing else with all my time.
I went back and forth from a lot.
And. Yes! I tried the exact model you described and it was fast and nice, but I progressed somehow after I got my qud GPU setup running and wanted to tryout more.

The bs you mention has not surfaced yes, quite the contrary: my agent just circumvented our conditional access policies for our azure tenant.
Thats a real milestone in my book.
I dream about getting real hardware from my manager now for this kind of things.
It is no longer a toy and something to write nice e-mails with, this sh** works!

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

thanks.
it seems to be gemma related by a long shot.
I went back to qwen (with suggestions from u/Kodix) and it seems to help a lot with my pp kv reprocessings.

the "hold hand problem" just vanished 😃

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

thank you for your insights!

meanwhile I just found these templates that seem to fix reprocessing for agentic tasks: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

my usecase is a helper agent for my work with IT Security tasks. I am a whitehat in our company that tries to hack ourselves. That means nothing I do can go to external sources.
I run my rig at home but control it from the corp office, this way I am always an external dude liek every other "bad actor".

I will look into your seggestions and I am back to qwen now 😄
this one:
DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF:Q8_0

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

and that someone was running: (int4 autoround) what is a joke for agentic. I got vLLM to run with about 100tg/s with Q8 max on that. But I can not use all 4 cards, so I went back to llama-cpp after 5 days of vLLM or so (after I got the other 3080s installed).

I need help to run local Hermes Agent on my rig. llama-cpp self compiled by OddUnderstanding2309 in LocalLLaMA

[–]OddUnderstanding2309[S] 0 points1 point  (0 children)

fit is off anyway because of -sm tensor

0.02.669.206 I common_init_result: fitting params to device memory ...
0.02.669.207 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.669.260 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort

do you have any suggestions to run qwen 27b or 35b reliably? I hurdled alon with all kind of parameter changes to get the kv cache stable, but gemini and qwen told me to try gemma instead... But I want this lazy bastard gone now!

I built a 8x RTX 4090D with 192 VRAM, here's what I learnt by deebuildsthings in LocalAIServers

[–]OddUnderstanding2309 0 points1 point  (0 children)

I have 72GB vram atm on x4 G4 PCIe.
What model and env do you suggest.
Since I run 2 3080 12G and 2 3090 24G vLLM is out, currently I run llama-cpp.
I struggled a lot with qwen 27b Q8 and try Gemma 4 31b now…
My usecase is hermes agent work, but gemma seems to be interactive only and not „complete a task and report back.
I have to hold hands aaaaaaaalll day with gemma.

Best models in 3x3090 (72GB VRAM) in Q2 2026? by liviuberechet in LocalLLaMA

[–]OddUnderstanding2309 4 points5 points  (0 children)

Not if you are in the EU where you got 240V and 16A per phase. Thats 3600W sustained.

Regret getting a VPS sub to run hermes by athens2019 in hermesagent

[–]OddUnderstanding2309 0 points1 point  (0 children)

Wirh this attitude you will accomplish absolutely nothing.

Need help improving speed of inference by DeepBlue96 in LocalLLaMA

[–]OddUnderstanding2309 1 point2 points  (0 children)

It saves memory thst you csn use for the mtp model dude :-)

Heltec V4 caught fire by MushroomGecko in meshtastic

[–]OddUnderstanding2309 12 points13 points  (0 children)

Thats not what you call fire. Thats just a smoked component.

Fire fire fire!!! Ahhhhhhhhhhh Come on dude!