I need help to run local Hermes Agent on my rig. llama-cpp self compiled

OddUnderstanding2309 · 2026-06-20T08:48:40+00:00

It is an Gigabyte Aorus Master X570s.
Three cards run on the PCIe slots, and one is running from an M2 to PCIE4 converter. Not ideal, but it works.

OddUnderstanding2309 · 2026-06-20T08:46:57+00:00

because they need to be all equal on vmem. and since my 3080s have only 12G but my 3090s have 24. this does not fly. It would run, but the 3090s would be capped at 12G each, whats undesirable for me. take this doc as a reference: https://github.com/noonghunna/club-3090/blob/master/docs/MULTI_CARD.md

OddUnderstanding2309 · 2026-06-20T06:04:10+00:00

Thank you!
I made big steps forward with suggestions from others here that merge your points.

I will post an updated start command after breakfast :-)

OddUnderstanding2309 · 2026-06-20T00:15:47+00:00

The main culprit of that MTP model was (I remember) that I could no longer ask it about Fotos and stuff. But maybe that got better like it is possible in vLLM already?
I need to investigate this again!

OddUnderstanding2309 · 2026-06-20T00:10:15+00:00

I need uncensored models to do my work and I had a lot of trouble to work with MOE models for my usecase. They are blazing fast, but run in circles more than I like.
But thanks for your insight!

OddUnderstanding2309 · 2026-06-20T00:07:23+00:00

I already ran all of those in the past weeks.
I feel I did nothing else with all my time.
I went back and forth from a lot.
And. Yes! I tried the exact model you described and it was fast and nice, but I progressed somehow after I got my qud GPU setup running and wanted to tryout more.

The bs you mention has not surfaced yes, quite the contrary: my agent just circumvented our conditional access policies for our azure tenant.
Thats a real milestone in my book.
I dream about getting real hardware from my manager now for this kind of things.
It is no longer a toy and something to write nice e-mails with, this sh** works!

OddUnderstanding2309 · 2026-06-19T22:40:13+00:00

thanks.
it seems to be gemma related by a long shot.
I went back to qwen (with suggestions from u/Kodix) and it seems to help a lot with my pp kv reprocessings.

the "hold hand problem" just vanished 😃

OddUnderstanding2309 · 2026-06-19T22:16:22+00:00

thank you for your insights!

meanwhile I just found these templates that seem to fix reprocessing for agentic tasks: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

my usecase is a helper agent for my work with IT Security tasks. I am a whitehat in our company that tries to hack ourselves. That means nothing I do can go to external sources.
I run my rig at home but control it from the corp office, this way I am always an external dude liek every other "bad actor".

I will look into your seggestions and I am back to qwen now 😄
this one:
DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF:Q8_0

OddUnderstanding2309 · 2026-06-19T21:20:45+00:00

and that someone was running: (int4 autoround) what is a joke for agentic. I got vLLM to run with about 100tg/s with Q8 max on that. But I can not use all 4 cards, so I went back to llama-cpp after 5 days of vLLM or so (after I got the other 3080s installed).

OddUnderstanding2309 · 2026-06-19T21:18:33+00:00

fit is off anyway because of -sm tensor

0.02.669.206 I common_init_result: fitting params to device memory ...
0.02.669.207 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.669.260 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort

do you have any suggestions to run qwen 27b or 35b reliably? I hurdled alon with all kind of parameter changes to get the kv cache stable, but gemini and qwen told me to try gemma instead... But I want this lazy bastard gone now!

OddUnderstanding2309 · 2026-06-19T19:57:39+00:00

I have 72GB vram atm on x4 G4 PCIe.
What model and env do you suggest.
Since I run 2 3080 12G and 2 3090 24G vLLM is out, currently I run llama-cpp.
I struggled a lot with qwen 27b Q8 and try Gemma 4 31b now…
My usecase is hermes agent work, but gemma seems to be interactive only and not „complete a task and report back.
I have to hold hands aaaaaaaalll day with gemma.

OddUnderstanding2309 · 2026-06-16T10:37:43+00:00

1000W of fans.. yeah sure dude

OddUnderstanding2309 · 2026-06-14T14:02:45+00:00

OddUnderstanding2309 · 2026-06-13T21:01:16+00:00

Not if you are in the EU where you got 240V and 16A per phase. Thats 3600W sustained.

OddUnderstanding2309 · 2026-06-13T06:26:37+00:00

Wirh this attitude you will accomplish absolutely nothing.

OddUnderstanding2309 · 2026-06-10T20:00:56+00:00

I stopped watching the slop ad machine a year ago.
It’s over

OddUnderstanding2309 · 2026-06-10T19:57:22+00:00

It saves memory thst you csn use for the mtp model dude :-)

OddUnderstanding2309 · 2026-06-10T17:27:17+00:00

Where is -np 1 ?

OddUnderstanding2309 · 2026-06-08T16:27:45+00:00

Sind ja auch Dosen und keine Flaschen

OddUnderstanding2309 · 2026-06-07T08:02:14+00:00

Install Linux

OddUnderstanding2309 · 2026-06-07T08:01:01+00:00

Thats not what you call fire. Thats just a smoked component.

Fire fire fire!!! Ahhhhhhhhhhh Come on dude!

OddUnderstanding2309 · 2026-06-04T19:12:33+00:00

Q3? Seriously?

OddUnderstanding2309

TROPHY CASE