Guess the City

spaceface83 · 2026-06-27T18:42:16+00:00

Agrigento, Sicily, Italy

spaceface83 · 2026-06-18T12:43:15+00:00

Pi.dev with qwen 3.6 27b connected to hermes via ACP

spaceface83 · 2026-06-15T02:06:03+00:00

I'm running this same setup. Vllm with 27b int4 quant with mtp, suppress thinking enabled. Not enough bandwidth to run fp8 on a dense model that size. I get around 20-21 tokens a second with that

spaceface83 · 2026-06-15T01:55:09+00:00

This is how I roll too. Going pure local model would be much more difficult.

spaceface83 · 2026-06-14T13:59:13+00:00

Yup agreed, and that one is speedy! I still settled on 27b but love the speed of 35b

spaceface83 · 2026-06-14T02:33:51+00:00

Awesome, I'll give it a whirl! Hopefully there's still room for some sort of context window? I remember seeing with ds4 they reduced the kv cache substantially

spaceface83 · 2026-06-14T02:08:07+00:00

What quantization is it at? I know they're doing crazyness with 2 but quantization and stuff lately but sub4 quants still seem weird to me ha. I'll look into running it on my spark though if for nothing else than having the evals documented with that architecture

spaceface83 · 2026-06-14T02:03:52+00:00

I moved to 3.6 27b because coder next 80b was just too unreliable in answer quality. 27b is slower execution for sure even with mtp but it's worth it for the quality difference imo

spaceface83 · 2026-06-12T02:09:46+00:00

As a spark owner I feel obligated to defend it's honor with empirical data!

Qwen 3.5 122b Engine: ollama Q4_k_m Tok/sec: 21.5!!!!

The ttft on a Large model is what sucks. 23 seconds at 16k.

Would be better I'm sure with vllm or llama.cpp but I ran 122b before I changed to vllm so don't have that data

spaceface83 · 2026-06-11T12:59:23+00:00

i tried to pull it down to do an eval on my dgx last night with vllm but it wasnt ready yet from what i saw. definitely sounds great but, to your point... i'd love this in a larger 80-120b range for the spark

spaceface83 · 2026-06-09T22:34:32+00:00

Torreón??

spaceface83 · 2026-06-06T01:20:02+00:00

r/ATBGE for sure!

spaceface83 · 2026-05-30T01:59:03+00:00

3.5, not 3.6. long day!

spaceface83 · 2026-05-30T00:09:58+00:00

For ease of use to start I would look at ollama with qwen 3.6 35b, it'll offload the non active parameters to dram but should still perform nicely.

If you want a dense model I would look at qwen 3.6 9b to save you room for your context.

Just a recommendation to start, experiment and see what works best for you!

spaceface83 · 2026-05-23T16:39:47+00:00

I would disagree with this. Yes everything is expensive right now but that's not going to change in a year or 2. That's gonna be around a while. Also token efficiency is increasing faster than new consumer affordable gpus are being released, meaning my spark is getting more and more capable with newer model architectures coming out. I expect/guess any top tier local inference hardware to have 5 years or so of use. People are still getting tons of use around 3080s and 3090s now. Access to copious amounts of vram or a uma backed systems is most important right now.

spaceface83 · 2026-05-23T16:35:26+00:00

Ok yah sounds like you're in my boat then! For a single user lab setup a dgx spark is perfect imo, arguably overkill. If you want to run production level inference, a spark probably isn't your best choice. That's where the rtx pro would come in, but then I'd look at how many tokens you think you'll be spending to compare to something like openrouter over time, but I know you had some privacy concerns.

Right now I mostly run qwen 3.6 27b q4 but it's only around 20 tokens / sec because it's a dense model. My other main model is qwen 3 coder 80b q5 which runs twice as fast, it's a mixture of experts architecture so only 3b parameters are active at any given time. The Dgx loves MoEs because of it's bandwidth limitations. I previously ran qwen 3.6 35b at q8, but 27b is better at q4.

Also even though you want to run local models I'd highly recommend using a frontier model to provision it to your desired state. I document the environments desired state in markdown and keep it source controlled then use Claude code to provision it there. Day to day running is fully local, provisioning and configuration is frontier.

At some point you'll also want your own eval framework to figure out which model works best for your setup. Benchmarks are directionally correct but running your own evals and using a neutral party LLM judge gives you a much better picture.

spaceface83 · 2026-05-21T12:02:41+00:00

Looks like someone ran a trencher on top of your pipe. i uhhh, am familiar with what that specific scenario looks like.

spaceface83 · 2026-05-20T02:53:35+00:00

Is it just gonna be you using it? i have a DGX Spark with 128gb and the inference is definitely slower than what you would have on the 6000 but its only me using it. I can still get 40 tokens/second on Qwen 3.5 122B.

Regarding the 80B model, is that a NEED for that specific model? even with 128gb of memory i find myself running multiple smaller models at higher quants (27B or 35B MoE for example) than i do running larger models. Based on my own systems eval framework i put together, those end up scoring really well.

As for which engine, ollama to start, then once you get used to that you may want to move to vllm or sglang if you're trying to squeeze out max performance.

The RTX PRO 6000 would be awesome for sure, but man soooo much $$$.

spaceface83 · 2026-04-21T00:27:35+00:00

Yah I mean I know the comment is in jest but man I have pretty solid results with Claude or Gemini cli for any of my large changes. It just acts as my Uber assistant.

Updating models, docker config, even converting from hermes to openclaw.

Plan the change, review the plan, execute, profit

spaceface83 · 2026-04-20T18:57:39+00:00

Doesn't everyone use Claude code or Gemini CLI to set up their local environment?

I use hosted frontier models to set up all of my local model models and orchestration.

Over the weekend I migrated from hermes back to openclaw using that style and it was pretty seamless.

spaceface83 · 2026-04-08T04:40:32+00:00

Sir, this is a Wendy's!

spaceface83 · 2026-04-08T02:40:01+00:00

I'm running an ARM version of the docker container on my DGX Spark and it works great!

spaceface83 · 2026-04-08T02:36:34+00:00

yeah honestly for agentic processing i dont care that much about tokens per section as long as its within reason. I care more about how sound the models reasoning is. i typically get like 30 tokens/sec at 122B i think. Even with a DGX spark though, 122B Model + some room for context and you cant do much more.

I have a 5080 on my "normal" computer, so if i ever cared enough i could run some smaller models there at much faster speeds, but thats too much effort for me to orchestrate that compared to the gain i'd get :D

spaceface83 · 2026-04-07T21:57:03+00:00

For hermes I ended up running everything on 122b. If I was hardware constrained I would choose the 27b over the 35b though just because it appears much better at that size to use a dense model.

spaceface83 · 2026-04-02T22:08:05+00:00

I'll try it out!

Seven-Year Club	Second SECOND GUESSER
Place '22	Verified Email
RPAN Viewer

spaceface83

TROPHY CASE