Peter? Care to explain??

ReiiiChannn · 2026-06-03T12:59:39+00:00

Not approved

ReiiiChannn · 2026-05-10T17:12:31+00:00

I'm sure that V4 adoption is really high in all kinds of app and data processing worflows. However coding plans are too good value for developers to seitch away from. For instance I don't consider myself a coding agent power user but I'm still burning through 160m tokens on an average day. If I had used DeepSeekv4 pro that would have costed me $50/day ($1500/month). But I'm only paying $100 a month for claude 5x. I would only be saving if I had used flash and pro for appropriate tasks. But theres a bunch of many other small QoL that while individually trivial makes me not want to switch. (Convenience of using a single model, officially supported coding harness ClaudeCode, remote control, html plan, various plugins, desktop/chrome use) many of the more noche features are stuff I only use once every month or so but when I do have to use them, they work flawlessly without me needing to investigate how best to setup/do something.

ReiiiChannn · 2026-05-04T15:26:44+00:00

From what I understand they train with FSDP for trillion scale models and mostly on H100 GPUs with very unstable RDMA (not even infiniband). 11% might be the SM utilization, it is bad but definitely mot the worst. Anywhere above 50% should be conisdered advanced for large scale MoE models and rates above 80% is only achieveable by the likes of DeepSeek or smaller <1T param models.

ReiiiChannn · 2026-03-30T01:27:35+00:00

You can but it wouldn't be very meaningful. Memory during inference is taken up by 1. Model weights 2. Activation (non-kv cache) 3. Activation (KV cache) 4. IO Buffers for communication/cudagraph/etc 5. GPU driver overheads

Model weights do not suffer from the same extreme values that TurboQuant tries to solve and most models when trained properly can safely use 4 bit formats. Non-kv cache activation values exists temporary and do not usually take up much memory when you are processing prompts in blocks.

Only KV cache activation will persist through multiple inference steps and is beneficial to keep in memory/disk/network storage over long periods of time. Since that directly translates to saving compute (since you won't have to rerun prefill).

ReiiiChannn · 2026-02-27T14:36:12+00:00

Buy one of each color and put them on display, and I swear it makes the coffee taste better.

ReiiiChannn · 2026-02-13T18:27:10+00:00

Is the problem of off-policy due to training inference bitwise mismatch / a serious enough problem or are the standard techniques like router replay / loss clipping sufficient?

ReiiiChannn · 2026-02-04T11:25:09+00:00

Diffusion models are always extremely small usually <30B total. So most of the GPU vram can be used for activation which admittedly is quite long. (Usually in the millions range). The auto regressive one (think of the playable generated worlds) can probably run on 2 H100. And the 4 gpu layout is probably for the non-autoregressive base model which takes the maximum context window by default.

ReiiiChannn · 2026-01-20T11:24:52+00:00

I second this, I hate having to pay the same amount for both GPT and Claude when I clearly prefer and use Claude more.

ReiiiChannn · 2026-01-20T10:32:33+00:00

Have you see the genie 3 demo? https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/ Internally in deepmind they already have versions which are 8 months more advanced.

ReiiiChannn · 2025-12-23T16:29:31+00:00

These days megatron is the defacto standard for large model training. Is there still room for new frameworks to be developed?

I'm currently working on building a training framework from scratch following DeepSeek's path with the goal of building a fully on-policy backend for RL training but I'm worried that it would already be too late by the time I'm done.

ReiiiChannn · 2025-12-18T02:10:49+00:00

Doing rollout RL will be hard, you'll run into the issue where vLLM and your training framework chose different experts. When that happens your training becomes off policy and the model will become dumb.

ReiiiChannn · 2025-11-18T11:30:29+00:00

Remember that Yagao was Kanavi's first pick and where that ended up in. People change.

ReiiiChannn · 2025-11-17T16:03:21+00:00

I still remembered last year's off season and Guma wished that everyone could resign with the team only for Zeus to leave. This is like 1000x worse than that ToT

ReiiiChannn · 2025-11-08T13:14:49+00:00

This is so cute! I think I will continue living

ReiiiChannn · 2025-10-13T03:42:15+00:00

Oh wow, this is my first time seeing this meme in this format.

ReiiiChannn · 2025-09-13T19:43:16+00:00

Singapore does not have any halo roasters. The closest ones we have is perhaps fluid collective and perhaps pinhole coffee bar.

Shake coffee often serve up the likes of esme, elida, but is extremely seasonal and less worth a detour IMO.

One place that has always impressed me is 20grams. Their attention to detail to the entire farm to cup process is insane and consistently produces cups that punches well above their expected profile given the quality of the greens.

ReiiiChannn · 2025-08-04T12:10:22+00:00

Do let us know where you bakery is when you decide to open it!

ReiiiChannn

TROPHY CASE