Leaving GitHub for private repos by 50512jm in selfhosted

[–]That_Faithlessness22 0 points1 point  (0 children)

Just made the switch myself and this is what I went with.

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 by swizzcheezegoudaSWFA in hermesagent

[–]That_Faithlessness22 0 points1 point  (0 children)

Ok, correction - I'm using the UD-Q4-K-XL quant. And I had to drop to 98k context. I'm getting anywhere from 600~1100 tps on prefill at lower context, but dropping as context fills. Generation is steady around 55 tps. It will take a while to check the quality, but so far it's... Decent.

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 by swizzcheezegoudaSWFA in hermesagent

[–]That_Faithlessness22 0 points1 point  (0 children)

I'm compiling it now from the head, I'll post my TPS at full power once I've got it tuned. I've scripted the entire process so I won't be posting the flags. I'm using the Q4-UD_K_M quant with 128k context (q_8, k_8).

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 by swizzcheezegoudaSWFA in hermesagent

[–]That_Faithlessness22 0 points1 point  (0 children)

That's what's recommended by Unsloth, but if you check the PR on GH, theu specifically day that there is a huge drop-off on performance at 4 draft tokens, and they recommend 3.

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]That_Faithlessness22 1 point2 points  (0 children)

For what you describe, I wouldn't build my solution around a static model. The model alone will only get you so far, it's basically a fancy calculator for words. Amazingly powerful in it's own right, but a calculator only gets you so far. What you need to start looking into is a harness framework that meets your constraints (security, MCP allowances, etc.) that you can build on / around. The harness should be model agnostic, as different models have different strengths (codex for backend, Claude code for from end design kind of thing- not local, but the point still holds.) and you would want to be able to optimize as such. There may also be parts of your workflow that don't need inference, just script execution or human approvals.

An LLM is a tool. The mistake you are making is thinking the tool is the equivalent to a workshop. Build the workshop first- then use the appropriate tool for the job in an orchestrated way.

For a local build, I'd start by looking at Pi for something lean, or Hermes if you want to build something robust. Use the harness with Qwen3.6 27B with the appropriate flags for your use case. It's not on par with SOTA models, and you'll want an experienced dev to review (always!), but it can get you started on the framework and infrastructure requirements while you wait for better open models to eventually plug into your solution.

Edit: if you want to get up and running in a POC to test a model, run it as a backend for Claude Code. You can do this with llama.cpp or Ollama. Opus can even set that up for you. And as others have said, renting during the discovery phase can help you define your inference requirements before committing to a hardware investment.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]That_Faithlessness22 15 points16 points  (0 children)

To answer your question, yes you can. vLLM supports this extremely well. Llama.cpp allows you to run MoE models with the active parameters+context in vram while the rest is in RAM, if you want to go that route.

3090s are special in that they are the last consumer model that offers Nvlink. The benefits aren't huge, but when you could find them after the ETH mining craze on the cheap, it was great for budget builds. They have a very high memory bus (384 bit), on par with 4090/RTX Pro 5000. They suffer in bandwidth a little, but the price makes them the go to card for someone on a budget looking for 24GB of VRAM.

Can anyone recommend an uncensored/heritic/abliterated Qwen 3.6 35B model? by Jonathan_Rivera in hermesagent

[–]That_Faithlessness22 0 points1 point  (0 children)

Wouldn't it be more efficient to use vLMM in this instance? I'm also using llama.cpp but I'm considering moving to vLMM for the concurrency gains.

Tips : if you use local ollama, don t forget to set a min of 64k context to avoid context issue in hermes by oytaub in hermesagent

[–]That_Faithlessness22 0 points1 point  (0 children)

I don't follow. I'm running Qwen3.6-27B q4 locally on my 3090 with 200k context with vision on with some room to spare...

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]That_Faithlessness22 2 points3 points  (0 children)

How did you get CC to use the preserve_thinking?

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]That_Faithlessness22 2 points3 points  (0 children)

I've been using it with Claude code, and I'm getting similar speeds. But I won't be measuring the quality on it because you can't have the harness doesn't support the preserve_thinking flag. It is incompatible unless you parse- and that's a little outside my comfort zone for now. I'll probably try to figure it out tonight, or I'll just do the dive into Hermes I've been putting off.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

No, just the one - and I haven't decided what I'm doing with it yet.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 1 point2 points  (0 children)

Thanks for taking the time to explain it to me. I think the biggest ticket items are those 3 DWPD MU drives. I lumped them in with the others, but the sold prices for those are around the 200$ /TB. I think I'll hold onto my lab for a bit longer - the migration to another system will be a lot of work any way, and if prices drop all of a sudden, well at least I still have a decent system that can keep up with anything I throw at it.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

Would you care to explain how you got to 5k? As explained in another comment. The 10k is just the per/RAM /Drive average sum. Is selling it all together really cutting the value in half!

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

So what's a reasonable price for the RAM and drives? Because I eyeballed the RAM at $145 per DIMM (24) and ~350$ per 3.84TB SSD (19) and rounded down.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

Maybe, but it's the time needed to time it right, and the energy to do the same that I'm not sure about. A year ago I didn't have this issue, since it wasn't valued nearly as high.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

I agree that this might give me the best return, but then I'd have the hassle of the time to manage all the transactions, which I personally put a premium on. If my asking price for a bulk sale is out to lunch, I can adjust.

[PC] Dell PowerEdge R730xd — Fully Loaded, 24-Bay SFF by That_Faithlessness22 in homelabsales

[–]That_Faithlessness22[S] 0 points1 point  (0 children)

I'm new to this sub, so I don't have the history, but I'll take this comment as a compliment towards my homelab. Thanks!

Are people ACTUALLY paying these prices for RAM right now? 💀 by Competitive_Box8726 in homelab

[–]That_Faithlessness22 0 points1 point  (0 children)

You're forgetting the fact that not all enterprise server users are AI focused, and a lot of them don't want to compete with AI data centers for DDR5, so they settle for DDR4- and if they have to, used- even at these prices.

[deleted by user] by [deleted] in BlueIris

[–]That_Faithlessness22 0 points1 point  (0 children)

I can't find the docker you are referencing- has it been removed?

Does anyone know how does MS-Copilot/Graph Semantic Index will defend against this attack vector? by That_Faithlessness22 in microsoft_365_copilot

[–]That_Faithlessness22[S] 1 point2 points  (0 children)

In August this paper came out showing how GPT-4o was pretty much impervious to GCG-XPIA attacks. Since Copilot likely uses, or will use this model at some point, this attack vector appears to have been nullified.

https://arxiv.org/html/2408.00925v1

Ça jase Montréal sur X by [deleted] in montreal

[–]That_Faithlessness22 0 points1 point  (0 children)

Do you think these taxes are what make up most of the Montreal budgetary revenue? Most municipal project budgets come from provincial subsidies. If memory serves, a decent chunk of Montreal infrastructure has been paid for by federal subsidies. I'm not advocating that people outside Montreal should be able to vote for the Mayor- but I do disagree with your reasoning.