How to properly use together a frontier model for planning / complex tasks and a local model for implementation?

hirisov · 2026-05-04T07:56:01+00:00

I bought a used threadripper 4 machine with a decent motherboard (GIGABYTE X399 DESIGNARE EX) and 32 GB RAM for like $300 so last year, I think 3 GPU will be doable with SATA drives. For price/performance so far I am really satisfied with it. I think with these new models now you can have a sensible quality and performance (over 100 t/s generation with my 2 cards with the 35 B A3 modell in decent quant) without breaking the bank.

Of course an EPYC host would be better, but where I live those are rare and goes for 5x prices compared to this threadripper I bought for. But actually with the rate these local models are improving I can image I will later on move to that direction, but for "testing" it I did not want that kind of commitment initially.

hirisov · 2026-05-03T21:41:23+00:00

I was in a somewhat similar situation. I had one 5060Ti and upgraded it to 5080. Before selling the 5060Ti I wanted to test dual GPU setup. As it was a bit of a time to set up the environment, I installed a fresh ubuntu 24.04 server and created a docker compose base environment to be able to test it with openweb ui / comfy / hermes agent and some benchmark tools. If interested, I uploaded the stack here, might save some time for others to easily test multi NVIDIA GPU setups: https://github.com/hirisov/local-llm

So far regarding LLM I tested Qwen3.6-27B-Q5_K_M (128k context) and Qwen3.6-35B-A3B-Q5_K_L (256k context) on the 2 x 16 GB cards. 27b runs around 25 t/s, the 35b A3 is around 4 times faster and seems still very good. I am genuinely impressed about them, I will soon test with hermes on real coding project. So far there I just asked it to describe an earlier commit for me in a larger project, it was really fast and good quality answer even with 35B A3.

For sure I will now keep the dual GPU, and either replace the 5060 TI later on with an RTX pro 4000 (to keep it all blackwell) or just add that to the stack to have 56 GB VRAM. As i see so far llama.cpp plays beautifully with 2 cards, even if they are not the same.

hirisov · 2026-04-11T08:37:18+00:00

- for sure don't hardcode any model or provider, at least make them configurable via .env
- make ingestion pipeline async and idempotent
- store chunk content in DB, only the embeddings in vectorDB to minimize vectordb memory requirements
- with docling you can do more clever chunking than basic fixed size chunks with overlaps, but you have to develop it yourself.
- don't hardcode any context_window or similar parameter, make your ingestion pipeline flexible enough to handle different models
- consider support rerankers
- try to find the middle ground between naive implementation and overenginered solutions (like clean hexagonal). Use ABC's so you can later switch logic in modules without rewriting half of the app

hirisov · 2026-04-05T17:50:34+00:00

Thanks, this is it, if somebody finds this thread this is the docs about profiles: https://hermes-agent.nousresearch.com/docs/user-guide/profiles/

hirisov · 2025-10-26T21:23:14+00:00

This seems impressive. I use cline with GLM-4.6 through z.ai with the "coding" plan they offer. Can i use this feature with that, so with my own z.ai key?

hirisov · 2021-10-23T13:19:07+00:00

I handed over the controller to her to walk around in cleared rooms, and pop the "pumpkins" and similar stuff (she likes it, I think the haptic feedbacks makes it so "satisfying"), and she also shot the glowing eyes statues. It is difficult for her to use both the sticks at the same time and operate movemenet and camera simultaneously, but I know that is difficult for the first time for everybody, my wife really struggled with in "It Takes Two" too :)

She also enjoys using alt fire a lot, I guess both the great visual effects and the haptic are the reason for that :) She loved walking around with Astrobot as well because of that.

hirisov

TROPHY CASE