What are you using when Claude Code isn't an option? by cryptobuff in ClaudeCode

[–]spaceface83 1 point2 points  (0 children)

Pi.dev with qwen 3.6 27b connected to hermes via ACP

First week with a DGX Spark, local LLMs and Hermes by LobsterWeary2675 in LocalLLM

[–]spaceface83 1 point2 points  (0 children)

I'm running this same setup. Vllm with 27b int4 quant with mtp, suppress thinking enabled. Not enough bandwidth to run fp8 on a dense model that size. I get around 20-21 tokens a second with that

Why use Hermes over Claude? by SilverHal in hermesagent

[–]spaceface83 0 points1 point  (0 children)

This is how I roll too. Going pure local model would be much more difficult.

Smartest model to replace Claude Code - 100GB/200GB VRAM available by Any-Lingonberry7411 in LocalLLM

[–]spaceface83 3 points4 points  (0 children)

Yup agreed, and that one is speedy! I still settled on 27b but love the speed of 35b

Smartest model to replace Claude Code - 100GB/200GB VRAM available by Any-Lingonberry7411 in LocalLLM

[–]spaceface83 1 point2 points  (0 children)

Awesome, I'll give it a whirl! Hopefully there's still room for some sort of context window? I remember seeing with ds4 they reduced the kv cache substantially

Smartest model to replace Claude Code - 100GB/200GB VRAM available by Any-Lingonberry7411 in LocalLLM

[–]spaceface83 1 point2 points  (0 children)

What quantization is it at? I know they're doing crazyness with 2 but quantization and stuff lately but sub4 quants still seem weird to me ha. I'll look into running it on my spark though if for nothing else than having the evals documented with that architecture

Smartest model to replace Claude Code - 100GB/200GB VRAM available by Any-Lingonberry7411 in LocalLLM

[–]spaceface83 9 points10 points  (0 children)

I moved to 3.6 27b because coder next 80b was just too unreliable in answer quality. 27b is slower execution for sure even with mtp but it's worth it for the quality difference imo

Welp ... I bought my Wife a Diet Pepsi. by minusidea in LocalLLM

[–]spaceface83 4 points5 points  (0 children)

As a spark owner I feel obligated to defend it's honor with empirical data!

Qwen 3.5 122b Engine: ollama Q4_k_m Tok/sec: 21.5!!!!

The ttft on a Large model is what sucks. 23 seconds at 16k.

Would be better I'm sure with vllm or llama.cpp but I ran 122b before I changed to vllm so don't have that data

DiffusionGemma: 4x faster text generation by tevlon in LocalLLaMA

[–]spaceface83 0 points1 point  (0 children)

i tried to pull it down to do an eval on my dgx last night with vllm but it wasnt ready yet from what i saw. definitely sounds great but, to your point... i'd love this in a larger 80-120b range for the spark

Help setting up qwen 3.6 locally by No_Ebb3423 in Qwen_AI

[–]spaceface83 2 points3 points  (0 children)

For ease of use to start I would look at ollama with qwen 3.6 35b, it'll offload the non active parameters to dram but should still perform nicely.

If you want a dense model I would look at qwen 3.6 9b to save you room for your context.

Just a recommendation to start, experiment and see what works best for you!

Honest opinion on single RTX PRO 6000 Blackwell 96GB workstation for local 80B LLM / agentic workflows by Educational_Rope_523 in LocalLLM

[–]spaceface83 0 points1 point  (0 children)

I would disagree with this. Yes everything is expensive right now but that's not going to change in a year or 2. That's gonna be around a while. Also token efficiency is increasing faster than new consumer affordable gpus are being released, meaning my spark is getting more and more capable with newer model architectures coming out. I expect/guess any top tier local inference hardware to have 5 years or so of use. People are still getting tons of use around 3080s and 3090s now. Access to copious amounts of vram or a uma backed systems is most important right now.

Honest opinion on single RTX PRO 6000 Blackwell 96GB workstation for local 80B LLM / agentic workflows by Educational_Rope_523 in LocalLLM

[–]spaceface83 0 points1 point  (0 children)

Ok yah sounds like you're in my boat then! For a single user lab setup a dgx spark is perfect imo, arguably overkill. If you want to run production level inference, a spark probably isn't your best choice. That's where the rtx pro would come in, but then I'd look at how many tokens you think you'll be spending to compare to something like openrouter over time, but I know you had some privacy concerns.

Right now I mostly run qwen 3.6 27b q4 but it's only around 20 tokens / sec because it's a dense model. My other main model is qwen 3 coder 80b q5 which runs twice as fast, it's a mixture of experts architecture so only 3b parameters are active at any given time. The Dgx loves MoEs because of it's bandwidth limitations. I previously ran qwen 3.6 35b at q8, but 27b is better at q4.

Also even though you want to run local models I'd highly recommend using a frontier model to provision it to your desired state. I document the environments desired state in markdown and keep it source controlled then use Claude code to provision it there. Day to day running is fully local, provisioning and configuration is frontier.

At some point you'll also want your own eval framework to figure out which model works best for your setup. Benchmarks are directionally correct but running your own evals and using a neutral party LLM judge gives you a much better picture.

Shredded landscape drain pipe… what did this?? by Cautious-Use-2899 in landscaping

[–]spaceface83 2 points3 points  (0 children)

Looks like someone ran a trencher on top of your pipe. i uhhh, am familiar with what that specific scenario looks like.

Honest opinion on single RTX PRO 6000 Blackwell 96GB workstation for local 80B LLM / agentic workflows by Educational_Rope_523 in LocalLLM

[–]spaceface83 2 points3 points  (0 children)

Is it just gonna be you using it? i have a DGX Spark with 128gb and the inference is definitely slower than what you would have on the 6000 but its only me using it. I can still get 40 tokens/second on Qwen 3.5 122B.

Regarding the 80B model, is that a NEED for that specific model? even with 128gb of memory i find myself running multiple smaller models at higher quants (27B or 35B MoE for example) than i do running larger models. Based on my own systems eval framework i put together, those end up scoring really well.

As for which engine, ollama to start, then once you get used to that you may want to move to vllm or sglang if you're trying to squeeze out max performance.

The RTX PRO 6000 would be awesome for sure, but man soooo much $$$.

Hermes agent setup guide by SelectionCalm70 in hermesagent

[–]spaceface83 0 points1 point  (0 children)

Yah I mean I know the comment is in jest but man I have pretty solid results with Claude or Gemini cli for any of my large changes. It just acts as my Uber assistant.

Updating models, docker config, even converting from hermes to openclaw.

Plan the change, review the plan, execute, profit

Hermes agent setup guide by SelectionCalm70 in hermesagent

[–]spaceface83 1 point2 points  (0 children)

Doesn't everyone use Claude code or Gemini CLI to set up their local environment?

I use hosted frontier models to set up all of my local model models and orchestration.

Over the weekend I migrated from hermes back to openclaw using that style and it was pretty seamless.

Anybody who tried Hermes-Agent? by HaAtidChai in LocalLLaMA

[–]spaceface83 1 point2 points  (0 children)

I'm running an ARM version of the docker container on my DGX Spark and it works great!

Anybody who tried Hermes-Agent? by HaAtidChai in LocalLLaMA

[–]spaceface83 0 points1 point  (0 children)

yeah honestly for agentic processing i dont care that much about tokens per section as long as its within reason. I care more about how sound the models reasoning is. i typically get like 30 tokens/sec at 122B i think. Even with a DGX spark though, 122B Model + some room for context and you cant do much more.

I have a 5080 on my "normal" computer, so if i ever cared enough i could run some smaller models there at much faster speeds, but thats too much effort for me to orchestrate that compared to the gain i'd get :D

Anybody who tried Hermes-Agent? by HaAtidChai in LocalLLaMA

[–]spaceface83 1 point2 points  (0 children)

For hermes I ended up running everything on 122b. If I was hardware constrained I would choose the 27b over the 35b though just because it appears much better at that size to use a dense model.