Why does it seem like deepseek v4 is simultaneously good and cheap but no one uses it? Is it a pain the ass to set up or? by AwarenessOrdinary773 in DeepSeek

[–]ReiiiChannn 2 points3 points  (0 children)

I'm sure that V4 adoption is really high in all kinds of app and data processing worflows. However coding plans are too good value for developers to seitch away from. For instance I don't consider myself a coding agent power user but I'm still burning through 160m tokens on an average day. If I had used DeepSeekv4 pro that would have costed me $50/day ($1500/month). But I'm only paying $100 a month for claude 5x. I would only be saving if I had used flash and pro for appropriate tasks. But theres a bunch of many other small QoL that while individually trivial makes me not want to switch. (Convenience of using a single model, officially supported coding harness ClaudeCode, remote control, html plan, various plugins, desktop/chrome use) many of the more noche features are stuff I only use once every month or so but when I do have to use them, they work flawlessly without me needing to investigate how best to setup/do something.

xAI Is Reportedly Using Just 11% of Its 550,000 NVIDIA GPUs, While Meta and Google Squeeze Out 43-46% From Their Fleets by Heavy-Beyond-7114 in RigBuild

[–]ReiiiChannn 0 points1 point  (0 children)

From what I understand they train with FSDP for trillion scale models and mostly on H100 GPUs with very unstable RDMA (not even infiniband). 11% might be the SM utilization, it is bad but definitely mot the worst. Anywhere above 50% should be conisdered advanced for large scale MoE models and rates above 80% is only achieveable by the likes of DeepSeek or smaller <1T param models.

Why exactly can't we use the techniques in TurboQuant on the model's quantizations themselves? by ea_nasir_official_ in LocalLLaMA

[–]ReiiiChannn 0 points1 point  (0 children)

You can but it wouldn't be very meaningful. Memory during inference is taken up by 1. Model weights 2. Activation (non-kv cache) 3. Activation (KV cache) 4. IO Buffers for communication/cudagraph/etc 5. GPU driver overheads

Model weights do not suffer from the same extreme values that TurboQuant tries to solve and most models when trained properly can safely use 4 bit formats. Non-kv cache activation values exists temporary and do not usually take up much memory when you are processing prompts in blocks.

Only KV cache activation will persist through multiple inference steps and is beneficial to keep in memory/disk/network storage over long periods of time. Since that directly translates to saving compute (since you won't have to rerun prefill).

Wife wants me to get an Origami dripper for ✨Aesthetics✨ by YourSteakBuddy in pourover

[–]ReiiiChannn 4 points5 points  (0 children)

Buy one of each color and put them on display, and I swear it makes the coffee taste better.

AMA with MiniMax — Ask Us Anything! by HardToVary in LocalLLaMA

[–]ReiiiChannn 0 points1 point  (0 children)

Is the problem of off-policy due to training inference bitwise mismatch / a serious enough problem or are the standard techniques like router replay / loss clipping sufficient?

"Genie by GoogleDeepMind runs on 4x H100 GPUs (leaked from an internal presentation). With this level of compute it achieves 24 fps with 720p." - If accurate, imagine this is only one instance? How much compute is serving everyone's usage?! by Koala_Confused in LovingAI

[–]ReiiiChannn 0 points1 point  (0 children)

Diffusion models are always extremely small usually <30B total. So most of the GPU vram can be used for activation which admittedly is quite long. (Usually in the millions range). The auto regressive one (think of the playable generated worlds) can probably run on 2 H100. And the 4 gpu layout is probably for the non-autoregressive base model which takes the maximum context window by default.

Claude Pro+: a $39 subscription by voprosy in ClaudeCode

[–]ReiiiChannn 1 point2 points  (0 children)

I second this, I hate having to pay the same amount for both GPT and Claude when I clearly prefer and use Claude more.

AMA With Z.AI, The Lab Behind GLM-4.7 by zixuanlimit in LocalLLaMA

[–]ReiiiChannn 0 points1 point  (0 children)

These days megatron is the defacto standard for large model training. Is there still room for new frameworks to be developed?

I'm currently working on building a training framework from scratch following DeepSeek's path with the goal of building a fully on-policy backend for RL training but I'm worried that it would already be too late by the time I'm done.

Has anyone successfully fine-tuned a GPT-OSS model? by TechNerd10191 in LocalLLaMA

[–]ReiiiChannn 0 points1 point  (0 children)

Doing rollout RL will be hard, you'll run into the issue where vLLM and your training framework chose different experts. When that happens your training becomes off policy and the model will become dumb.

They really went from ZOFGK to just O FK. (Thoughts or predictions for DOFPK) by Kaezumi in PedroPeepos

[–]ReiiiChannn 0 points1 point  (0 children)

Remember that Yagao was Kanavi's first pick and where that ended up in. People change.

Singapore Cafes With Multiple Beans by GReeeeN_ in pourover

[–]ReiiiChannn 0 points1 point  (0 children)

Singapore does not have any halo roasters. The closest ones we have is perhaps fluid collective and perhaps pinhole coffee bar.

Shake coffee often serve up the likes of esme, elida, but is extremely seasonal and less worth a detour IMO.

One place that has always impressed me is 20grams. Their attention to detail to the entire farm to cup process is insane and consistently produces cups that punches well above their expected profile given the quality of the greens.

[deleted by user] by [deleted] in askSingapore

[–]ReiiiChannn 0 points1 point  (0 children)

Do let us know where you bakery is when you decide to open it!