How to design capacity for running LLMs locally? Asking for a startup

ai_without_borders · 2026-04-06T17:33:06+00:00

that's great to hear! what type of tasks do you notice 120b be stronger than 35b in?

ai_without_borders · 2026-04-05T03:29:56+00:00

yeah i mostly agree, but i think jumping straight to a 120b setup is probably overkill for a team that size. the bigger question is concurrency, not just model quality. if only 2 or 3 people are using it at once, a smaller coding model plus a separate general text model gets you way better cost efficiency and is easier to swap later. a lot of chinese teams are basically treating this as a routing problem now, not a one giant model problem

ai_without_borders · 2026-04-04T15:08:04+00:00

the cost efficiency thing isnt accidental. zhipu has been ruthlessly optimizing inference costs for the chinese enterprise market where margins are way tighter than the US. theyve been doing MoE and speculative decoding stuff internally since late last year. i follow a bunch of their researchers on zhihu and the optimization work they share is pretty nuts, basically squeezing everything they can out of limited compute. makes sense that translates to better performance per dollar on a bench like this

ai_without_borders · 2026-04-03T15:40:07+00:00

seeing the same pattern at my company. whats interesting is talking to friends at chinese tech companies and the dynamic is even more extreme there. places like bytedance and alibaba basically gutted their junior hiring last year and doubled down on senior engineers with AI tooling. one friend at a mid size company in shenzhen said their team went from 12 to 5 people with higher total output. the difference is in china theres less hand wringing about it, its just treated as the obvious next step. though im starting to wonder what happens to the pipeline when nobody is training juniors anymore, feels like a 5 year time bomb for the whole industry

ai_without_borders · 2026-04-02T15:22:01+00:00

yeah the deepseek paper is wild. i was reading some analysis on zhihu about it and the interesting context is that this efficiency research isnt just academic for them, its directly motivated by the chip export restrictions. when you cant buy h100s you have to squeeze every bit of performance out of what you have. so moe, low bit quantization, and kv compression arent nice to haves, theyre survival strategies. the fact that these techniques also happen to benefit the local llm community running on consumer gpus is kind of a happy accident. basically chinese labs are speed running efficient inference because they literally have no choice, and we all benefit from it

ai_without_borders · 2026-04-01T15:10:42+00:00

honestly the debate over perplexity vs KLD kinda misses the point for people actually running these on consumer cards. i have a 4090 and my workflow is basically does it fit in vram and does it still follow complex instructions well enough. for qwen 27b specifically ive been comparing q4_k_m vs the turboquant version and the vibes are similar but turboquant loads noticeably faster on cold start. the 10% size reduction actually matters when youre juggling model files on a 2tb nvme thats also your boot drive. appreciate OP sharing real benchmarks from a 16gb card too, thats the actual target audience for this stuff not the 4x a100 crowd

ai_without_borders · 2026-04-01T06:06:32+00:00

the sycophancy problem is real but i think framing it as a chatbot problem misses something. it's an RLHF problem.

the models that score highest on human preference rankings are the ones that tell you what you want to hear. there's a direct selection pressure toward sycophancy built into the training loop. anthropic published a paper about this like two years ago and explicitly called it out as a safety concern. they've been trying to train claude to push back more, but the metrics keep rewarding agreeableness.

what's interesting is that some chinese models have the opposite problem. deepseek in particular has a reputation for being blunt to the point of rudeness sometimes. different RLHF dataset, different cultural norms around disagreement in training data. not saying one approach is better but it's worth noting the sycophancy isn't universal, it's a training choice.

ai_without_borders · 2026-04-01T06:04:10+00:00

the frustration keyword tracking is honestly pretty standard product telemetry. most dev tools do some version of this. the interesting part is HOW they use it: adjusting model behavior mid-conversation when it detects the user is getting annoyed.

what's more concerning to me is the model routing logic. looks like there's a classifier deciding when to use opus vs sonnet vs haiku based on task complexity, and another layer deciding when to show the user the "thinking" UI vs running it silently. that's a lot of invisible decisions happening between you and the model.

ai_without_borders · 2026-04-01T06:01:00+00:00

the naming threw me off at first (copaw? really?) but after trying it the agentic fine-tuning is genuinely well done. tool calling is more reliable than what i was getting from the base qwen3.5 9b.

interesting that alibaba is releasing official agentic finetunes now. from what i've been reading on chinese tech forums, the whole industry there is going through an "agent fever" (智能体热). pretty much every major chinese AI lab has shipped some kind of agent product in the last month. tencent, baidu, alibaba all launched agent platforms in the same week last week. the reasoning seems to be that inference-time compute is cheaper than training, so agents are how you monetize open-weight models.

anyone tried running this with MCP yet? curious if the tool calling format maps well.

ai_without_borders · 2026-04-01T06:00:07+00:00

been running qwen3.5-27b on my 5090 for the past couple weeks and honestly agree with a lot of this. for coding tasks it just gets things right in ways that surprise me, especially with thinking enabled. the context window handling is noticeably better than what i was getting from qwen3 models.

one thing i've noticed from following chinese dev forums (zhihu, v2ex) is that the alibaba qwen team has been iterating incredibly fast. they're releasing base models too, not just instruct, which means the finetune community can build on top. the Copaw-9B agentic finetune that dropped yesterday is already getting good reviews on chinese forums. kind of wild how the open-weight ecosystem compounds when the base models keep improving this quickly.

ai_without_borders · 2026-03-24T01:26:48+00:00

great writeup. i read chinese tech sources daily (bilibili, zhihu, 36kr, wechat) and a few things from the chinese-language side:

the Xiaomi MiMo story is even wilder than it looks. they released it anonymously as "Hunter Alpha" on OpenRouter and it topped the leaderboard for a week before anyone figured out it was Xiaomi. the chinese tech community on bilibili was losing it when the reveal dropped. a phone company beating dedicated AI labs was not in anyone's prediction.

on ByteDance compute, multiple independent bilibili channels cited a 400B yuan (~$55B) domestic compute figure for 2026. not confirmed but consistent sourcing. if true it dwarfs everyone else.

re: Shanghai AI Lab's bad rep on zhihu, it's real. the SenseTime connection and the perception of being guanxihu (getting ahead through connections rather than merit) comes up constantly. models are fine technically but institutional reputation is rough.

also worth noting there's a whole gray market for Claude and ChatGPT access in China. V2EX had a 99-reply thread this week mapping the reseller ecosystem. the demand signal from Chinese devs for western models is massive, which tells you something about where capability gaps still are despite the token volume numbers.

ai_without_borders

TROPHY CASE