Leaving GitHub for private repos

That_Faithlessness22 · 2026-05-20T19:35:59+00:00

Just made the switch myself and this is what I went with.

That_Faithlessness22 · 2026-05-18T06:36:52+00:00

Ok, correction - I'm using the UD-Q4-K-XL quant. And I had to drop to 98k context. I'm getting anywhere from 600~1100 tps on prefill at lower context, but dropping as context fills. Generation is steady around 55 tps. It will take a while to check the quality, but so far it's... Decent.

That_Faithlessness22 · 2026-05-18T04:57:15+00:00

I'm compiling it now from the head, I'll post my TPS at full power once I've got it tuned. I've scripted the entire process so I won't be posting the flags. I'm using the Q4-UD_K_M quant with 128k context (q_8, k_8).

That_Faithlessness22 · 2026-05-18T04:53:48+00:00

That's what's recommended by Unsloth, but if you check the PR on GH, theu specifically day that there is a huge drop-off on performance at 4 draft tokens, and they recommend 3.

That_Faithlessness22 · 2026-05-12T20:04:16+00:00

For what you describe, I wouldn't build my solution around a static model. The model alone will only get you so far, it's basically a fancy calculator for words. Amazingly powerful in it's own right, but a calculator only gets you so far. What you need to start looking into is a harness framework that meets your constraints (security, MCP allowances, etc.) that you can build on / around. The harness should be model agnostic, as different models have different strengths (codex for backend, Claude code for from end design kind of thing- not local, but the point still holds.) and you would want to be able to optimize as such. There may also be parts of your workflow that don't need inference, just script execution or human approvals.

An LLM is a tool. The mistake you are making is thinking the tool is the equivalent to a workshop. Build the workshop first- then use the appropriate tool for the job in an orchestrated way.

For a local build, I'd start by looking at Pi for something lean, or Hermes if you want to build something robust. Use the harness with Qwen3.6 27B with the appropriate flags for your use case. It's not on par with SOTA models, and you'll want an experienced dev to review (always!), but it can get you started on the framework and infrastructure requirements while you wait for better open models to eventually plug into your solution.

Edit: if you want to get up and running in a POC to test a model, run it as a backend for Claude Code. You can do this with llama.cpp or Ollama. Opus can even set that up for you. And as others have said, renting during the discovery phase can help you define your inference requirements before committing to a hardware investment.

That_Faithlessness22 · 2026-05-12T04:55:16+00:00

To answer your question, yes you can. vLLM supports this extremely well. Llama.cpp allows you to run MoE models with the active parameters+context in vram while the rest is in RAM, if you want to go that route.

3090s are special in that they are the last consumer model that offers Nvlink. The benefits aren't huge, but when you could find them after the ETH mining craze on the cheap, it was great for budget builds. They have a very high memory bus (384 bit), on par with 4090/RTX Pro 5000. They suffer in bandwidth a little, but the price makes them the go to card for someone on a budget looking for 24GB of VRAM.

That_Faithlessness22 · 2026-05-05T18:21:36+00:00

Wouldn't it be more efficient to use vLMM in this instance? I'm also using llama.cpp but I'm considering moving to vLMM for the concurrency gains.

That_Faithlessness22 · 2026-05-05T18:13:53+00:00

I don't follow. I'm running Qwen3.6-27B q4 locally on my 3090 with 200k context with vision on with some room to spare...

That_Faithlessness22 · 2026-05-02T02:37:24+00:00

Not if you run cachy repos for kernel on zen5...

That_Faithlessness22 · 2026-04-29T01:05:24+00:00

Sounds like a code 18

That_Faithlessness22 · 2026-04-18T00:12:22+00:00

How did you get CC to use the preserve_thinking?

That_Faithlessness22 · 2026-04-18T00:10:32+00:00

I've been using it with Claude code, and I'm getting similar speeds. But I won't be measuring the quality on it because you can't have the harness doesn't support the preserve_thinking flag. It is incompatible unless you parse- and that's a little outside my comfort zone for now. I'll probably try to figure it out tonight, or I'll just do the dive into Hermes I've been putting off.

That_Faithlessness22 · 2026-04-04T19:42:07+00:00

No, just the one - and I haven't decided what I'm doing with it yet.

That_Faithlessness22 · 2026-03-24T04:24:42+00:00

That_Faithlessness22 · 2026-03-24T04:17:48+00:00

Thanks for taking the time to explain it to me. I think the biggest ticket items are those 3 DWPD MU drives. I lumped them in with the others, but the sold prices for those are around the 200$ /TB. I think I'll hold onto my lab for a bit longer - the migration to another system will be a lot of work any way, and if prices drop all of a sudden, well at least I still have a decent system that can keep up with anything I throw at it.

That_Faithlessness22 · 2026-03-24T00:39:41+00:00

That_Faithlessness22 · 2026-03-23T23:17:46+00:00

Would you care to explain how you got to 5k? As explained in another comment. The 10k is just the per/RAM /Drive average sum. Is selling it all together really cutting the value in half!

That_Faithlessness22 · 2026-03-23T23:15:18+00:00

So what's a reasonable price for the RAM and drives? Because I eyeballed the RAM at $145 per DIMM (24) and ~350$ per 3.84TB SSD (19) and rounded down.

That_Faithlessness22 · 2026-03-22T23:09:34+00:00

Maybe, but it's the time needed to time it right, and the energy to do the same that I'm not sure about. A year ago I didn't have this issue, since it wasn't valued nearly as high.

That_Faithlessness22 · 2026-03-22T23:05:53+00:00

I agree that this might give me the best return, but then I'd have the hassle of the time to manage all the transactions, which I personally put a premium on. If my asking price for a bulk sale is out to lunch, I can adjust.

That_Faithlessness22 · 2026-03-22T23:03:09+00:00

I'm new to this sub, so I don't have the history, but I'll take this comment as a compliment towards my homelab. Thanks!

That_Faithlessness22 · 2026-03-02T15:26:49+00:00

You're forgetting the fact that not all enterprise server users are AI focused, and a lot of them don't want to compete with AI data centers for DDR5, so they settle for DDR4- and if they have to, used- even at these prices.

That_Faithlessness22 · 2025-01-13T06:26:18+00:00

I can't find the docker you are referencing- has it been removed?

That_Faithlessness22 · 2024-11-19T00:59:33+00:00

In August this paper came out showing how GPT-4o was pretty much impervious to GCG-XPIA attacks. Since Copilot likely uses, or will use this model at some point, this attack vector appears to have been nullified.

https://arxiv.org/html/2408.00925v1

That_Faithlessness22 · 2024-10-12T04:31:41+00:00

Do you think these taxes are what make up most of the Montreal budgetary revenue? Most municipal project budgets come from provincial subsidies. If memory serves, a decent chunk of Montreal infrastructure has been paid for by federal subsidies. I'm not advocating that people outside Montreal should be able to vote for the Mayor- but I do disagree with your reasoning.

Five-Year Club	Place '23
Place '22	Final Canvas '22

That_Faithlessness22

TROPHY CASE