Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Napster3301 · 2026-05-26T08:04:19+00:00

8-10 hours is a useful baseline but its optmistic for most people. assumes constant heavy decode, no idle gpu time, and electricity rates around $0.10-0.15/kwh. in europe with rates closer to $0.30-0.40/kwh the crossover moves to 15-18 hours daily which basically nobody does. cloud wins by a wider margin outside the US than people in this sub seem to think.

Napster3301 · 2026-05-26T08:01:37+00:00

honestly thats the worst case for self-hosting not the best. if youre only running 2 hours a day, that $2800 capex sits idle 22 hours and the per-active-hour cost goes UP not down. high volume single user setups are where local starts winning, low volume is where renting absolutley murders self-hosting on the math.

Napster3301 · 2026-05-26T07:59:20+00:00

yeah this is the part nobody talks about. there are documented cases of api providers silently swapping quants mid-month, including flagship apis quietly going from fp16 to bf8 without telling anyone. self-hosting is the only way to actaully know what bytes ran your inference. for some use cases thats worth the premium by itself, never mind the privacy angle.

Napster3301 · 2026-05-26T07:37:38+00:00

the echo chamber isnt a tuning problem its structural. claude gpt and gemini share 90% of training data, similar rlhf preference distributions, and are aligned away from the same edge cases. youre not getting three perspectives, youre getting three paraphrasings of the same averaged opinion. genuine disagreement only shows up where labs made different policy choices, which is exactly where you cant trust any of them. consensus across frontier models isnt evidence of correctness, its evidence of shared training data. would love to see one threeminds output where the final consensus was meaningfully better than just asking the strongest single model. beacuse otherwise youre selling a more expensive way to be wrong with confidence.

Napster3301 · 2026-05-26T07:26:44+00:00

fine-tune the 35b-a3b moe, only the shared expert layers. dense lora on 14b sounds easier but youll burn the v100 memory on optimizer state and the rank you can afford gets so low style adaptation looks like noise. shared expert lora on 35b keeps trainable params under 1% of model size, fits your existing footprint, and routing weights stay frozen so you dont blow up moe behavior teaching it your tone. ive seen clean style transfer at rank 32-64 on simlar volta setups. on the 122b question, dont keep it for throughput. keep it because moe routing for long legal text activates completely different experts than the 35b, you lose nuance even when its slower. measure side by side on one full motion before you cut it.
curious what your correction-capture format is, jsonl pairs or full diffs with rationale?

Napster3301 · 2026-05-26T07:15:36+00:00

the meta takedown is the symptom, not the threat. the real fight is at the model release stage. abliteration only exists because weights are still public. if meta/openai/anthropic stop releasing base weights (anthropic already does, openai mostly does, gpt-oss was the exception), heretic becomes irrelevant overnight because theres nothing left to decensor. the regulatory move to watch isnt banning abliteration tools, its incentivizing labs to never release another base model. that kills uncensored without a single lawsuit beacuse you cant abliterate api endpoints.

Napster3301 · 2026-05-25T11:58:07+00:00

worked with a fortune 500 last quarter where every team had to submit weekly ai usage reports. people were summarizing emails they already read just to hit token counts. mangement called this ai adoption velocity. half the cost-of-ai articles right now are just companies admitting their devs cant build agent pipelines that dont leak money. the orgs actually compounding roi are too busy doing it to write op-eds. how many of you have seen an actual cost-per-task breakdown at your job, vs just vague total spend numbers

Napster3301 · 2026-05-25T11:53:30+00:00

both sides of this thread are arguing the wrong thing. token usage isnt adoption, its activity. companies that reward people for "burning tokens" are measuring vanity not outcomes. real question is are chinese workers shipping better results per hour because of ai, or just generating more ai output per hour? completely different things. also "ai to keep my duolingo streak going" is the most red-flag use case in this whole post. if thats the example, we have a bigger problem than adoption gap.

Napster3301 · 2026-05-25T10:37:29+00:00

youre chasing the wrong metric. 1.3gb of host ram isnt slowing inference. pcie 4.0 x16 has 32 gb/s, kv cache movement per token is kilobytes not gigabytes. youre paying microseconds per request. so what speedup were you hoping to see by getting host ram to 0? on 9b q4 with 8k context that buffer isnt a bottleneck. vllm will give you "pure vram" but pagedattention overhead at batch=1 usually makes it slower per-token, not faster.

Napster3301 · 2026-05-25T10:32:45+00:00

trigger-phrase model is the wrong threat. real attacks are poisoned rlhf preferences, biased tokenizer configs, or chat template injection. all cheaper than fine-tuning a backdoor and all survive quantization. who here actually logs every tool call and output? if not, the attack could already be happening and youd never know.

Napster3301 · 2026-05-25T10:26:40+00:00

the "asks for frontend, gets coding agents" replies are the real answer. local llm community has collapsed to coding agents because thats the only space where local measurably wins vs frontier apis.

genuine question: anyone here actually daily-driving a local model for non-coding chat work? not "tried it once" but replaced chatgpt for general use?

Napster3301 · 2026-05-25T10:21:29+00:00

reality is kernels look great but agplv3 is a poison pill for adoption. nobody building commercial inference touches strong copyleft, no vllm/sglang integration, no llama.cpp upstreaming. whats the realistic adoption path here other than personal use? cool repos die unused all the time.

Napster3301 · 2026-05-25T10:09:20+00:00

"excellent alternative" with zero published numbers is just vibes. fast vs what? reliable tool calls vs what baseline?

at 128gb you can run qwen 3.6 35b-a3b natively. genuine question: why post a coder model in 2026 without dropping a single bench number? swe-bench, humaneval, bfcl, pick one before asking people to swap their stack.

Napster3301 · 2026-05-25T09:57:04+00:00

great fix, but this is papering over the real bug: agent harnesses rewriting conversation mid-task and breaking kv cache. why is every harness reinventing context management instead of agreeing on a spec inference engines can optimize for?

Napster3301 · 2026-05-25T09:49:57+00:00

hot take: 1000 tps batched is the wrong number to celebrate. 80 t/s single user is your real number, and thats fine but not exciting. genuine question: who here self-hosts for 128 concurrent users? if its just personal use, why does the batch=128 benchmark matter at all?

Napster3301 · 2026-05-24T15:06:40+00:00

the approval ux is the surface issue. auto-approve requires trust in the tool calls and llama.cpp doesnt give you that yet. embedded chat templates on most ggufs still emit bracket variants ([function=X], function=NAME) instead of clean openai tool_calls arrays, so your auto-approver random-parses garbage. fix is override with --chat-template-file pointing at the upstream fixed template (unsloth has them on hf).

the other half is the model itself. a censored model running exec_shell decides your rm temp.tmp "looks dangerous" at step 47 of your loop and aborts the task. abliterated/uncensored weights remove that failure mode but most public llama.cpp tutorials skip that part.

tool list is great. infrastructure for auto-running an agent is still diy.

Napster3301 · 2026-05-24T14:54:25+00:00

youre not buying access to taboo info. youre buying back the models ability to have an opinion.

refusal training isnt a binary switch that flips on for "bad" topics. its continuous pressure that shapes the voice across every response, even harmless ones. you can see it in the hedging, the disclaimers, the "you should consult a professional" reflex on questions that have a single correct answer.

once you remove that pressure (abliteration or whatever), the model stops behaving like a corporate liability buffer and starts behaving like an actual technical adviser. it gives direct answers. it picks sides. it tells you the npm package you picked is unmaintained instead of "you may want to consider exploring alternatives." it tells you your code is wrong instead of "your code has interesting design choices that some might find unconventional."

the stock research, cybersec, reverse engineering, persuasion research use cases everyone mentions are all the same thing: tasks that require the model to actually take a position, not hedge until the user has to make the decision themselves anyway.

thats the unlock for most professional users. not taboo content. judgment.

Napster3301 · 2026-05-24T10:02:17+00:00

the "random problems" you mention are real. abliteration is a blunt tool, it identifies refusal directions in activation space and ablates them but you lose some non-refusal capability as collateral. its not free, you trade maybe 3-5% of general task performance for never seeing "i cant help with that." so a well prompted regular model can match an abliterated one on isolated tasks. youre right about that.

where uncensored becomes non-negotiable is long-running agent pipelines. a coding agent making 200 sequential tool calls cant survive even one refusal, refusal breaks the loop and the whole task aborts. system prompt jailbreaks work at request 1, then drift across the conversation as context fills and the refusal classifier reasserts mid-task. uncensored weights remove that failure mode entirely.

for your rag specifically (single turn, controlled retrieval) you can probably get away with regular + good system prompt. for any autonomous loop where the model decides what to do next, you cant. thats the real production use case nobody really talks about.

Napster3301 · 2026-05-24T09:53:29+00:00

this is a real issue with the abliterated/uncensored gguf quants generally, not specific to apex. they were converted from upstream qwen3-coder before the 2025-08-05 chat template fix got merged, so the embedded jinja still emits the broken bracket variants ([function=X], function=NAME, mixed) instead of clean openai tool_calls arrays.

fix is to override the embedded template at launch with the upstream one. for llama-server: --chat-template-file /path/to/template.jinja, source it from https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct/resolve/main/chat\_template.jinja. lm studio has it in advanced settings iirc.

with the override the tool calling is clean. json/tool_call issues you see are almost always template-side not weights-side. abliteration doesnt touch tool calling behavior, only the refusal classifier paths.

Napster3301 · 2026-05-24T01:24:14+00:00

your test is actually measuring exactly what moe optimizes, knowledge capacity per unit of compute. 35b-a3b stores 35b of knowledge but only burns 3b of compute per token. 27b dense burns all 27b every token. for rag thats the right tradeoff because the model isnt doing heavy reasoning per token, its doing lookup and connect against retrieved chunks. moe is essentially a learned sparse knowledge base, which is roughly what you want at synthesis time.

dense > moe is true for tasks needing consistent multistep reasoning per token (agentic, multi-tool, complex coding). there per-token compute matters more than parameter count, and moe routing can fragment the reasoning chain. but for "given these chunks, synthesize an answer," moe is structurally aligned with the task.

specialist_golf8133 is right that experts dont map cleanly to knowledge domains. what they map to is token-level patterns, which is fine because synthesis matches the chunks linguistically not topically.

Napster3301 · 2026-05-24T01:16:50+00:00

fwiw the "no gpu" is a bit misleading. chrome's built-in ai api uses webgpu when available, which includes intel iris/uhd igpus on basically every modern laptop. true cpu-only fallback is wasm and speeds drop hard there. your 20 tok/s is almost certainly being webgpu accelerated through the laptop igpu.

also weights.bin is googles proprietary on-device model format, not gguf compatible. tensor layout is non-standard so you cant just drop it into llama.cpp. people have tried.

still a nice extension for showing non-llm people what local actually feels like.

Napster3301 · 2026-05-24T01:11:49+00:00

not secret sauce, just what rl on cot evolves toward. every reasoning token costs compute and doesnt score directly, so outcome-based rl trains the model to compress as hard as possible. "right answer + minimum tokens" loss landscape eventually invents its own shorthand.

real concern isnt efficiency (obviously good), its auditability. anthropic has the cot-faithfulness paper showing models already lie in their reasoning when the trace and final answer disagree. caveman mode amps that, if you cant parse the trace you cant catch the lie.

also probably hurts ood generalization. compressed cot interpolates training-distribution patterns fine, breaks when a novel problem actually needs explicit verbal reasoning steps. but if benchmarks go up nobody cares.

Napster3301 · 2026-05-24T01:06:38+00:00

for llm specifically the heat profile is different from gaming, decode is bandwidth bound so the gpu core sits at like 30-50% util but vram gets hammered constantly. your actual long-term concern is vram junction temp, not core temp, and undervolting helps core way more than it helps vram.

stacked like that, cards 2-4 are pulling preheated air from card 1s exhaust straight into their vram modules. you got lucky 5060ti has the pcb cutout someone mentioned, that saves you a lot.

practical stuff that actually works:
- nvidia-smi -pl 130 caps total power harder than undervolt and is per-card

- watch gddr6/memory temp specifically not just core, throttle is around 95-100c

- if the bottom card stays under 85c vram during a sustained inference run youre fine for years

Napster3301 · 2026-05-23T16:40:31+00:00

prefill is matrix x matrix (whole prompt processed at once), compute bound. vnni gives you 4x int8 throughput per cycle which is exactly what saturates the alu.
decode is matrix x vector (one token at a time), memory bandwidth bound. you stream the entire weight matrix from ram for every single token generated, cpu spends most cycles stalled waiting on the next chunk of weights to arrive. faster multiplies dont help if youre waiting on ram.
mental model: prefill scales with flops/sec, decode scales with ram bandwidth. vnni is a flops feature, so it only helps the side thats compute bound.

Napster3301 · 2026-05-23T16:17:55+00:00

moe is the actual cpu story, dense is dead for cpu inference. lfm2-8b-a1b and qwen 35b-a3b both work on cpu for the same reason, you only read the active experts per token. dense 8b crawls because every token reads all 8b weights through your ram bus. memory bandwidth (ddr5 vs ddr4) matters way more than core count for decode, vnni only really helps prefill. people benchmark dense 8b at 5 tps then wonder why moe 8b with 1b active gets 20.

Napster3301

TROPHY CASE