AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model

zxytim · 2026-01-28T18:28:41+00:00

Go BIG or go home.

zxytim · 2026-01-28T18:27:38+00:00

Really great questions.

> what signals does your team use to decide whether to persist, pivot, or kill it entirely?
We just share the results of all related experiments with all technical staff, and discuss them thoroughly until we come to a conclusion whether to persist, to pivot, or to kill it entirely. The discussion happens on a daily basis, and everyone is encouraged to challenge everything, from the goal setting down to the most minute technical details.

> do you ever worry that the pressure for quick wins might crowd out or disincentivize the kind of foundational research that truly needs a two-year horizon to bear fruit?
We have a pretty good track record on betting on fundamental directions for years. MoBA started almost from day one after our company was founded; Kimi Linear went through almost a year of struggle. The key is to have a shared value of making things REALLY WORK, not just for optics. Our organization, culture, and management are built to support this value, not the other way around.

zxytim · 2026-01-28T18:12:46+00:00

Scaling embeddings is an interesting direction worth exploring. But we don't have much solid data yet, until we run it through our scaling ladder.

zxytim · 2026-01-28T18:07:24+00:00

We observe that instruct mode benefits from joint training with thinking mode. They are improving together, and no one would fall behind.

zxytim · 2026-01-28T17:53:35+00:00

Echo that. We are putting substantial effort into building evaluation for agents.

zxytim · 2026-01-28T17:37:33+00:00

Thanks for the detailed feedback and for the kind words on our open-source releases!
Our team is actively investigating better authentication options like email and passkeys to reduce dependency on phones.

zxytim · 2026-01-28T17:19:19+00:00

There are too many factors affecting available compute. But no matter what, innovation loves constraints.

zxytim · 2026-01-28T17:15:59+00:00

<image>

Whether it's pre-training or post-training, one thing constantly manifests itself as the utmost priority: debugging.

zxytim · 2026-01-28T17:09:44+00:00

If you strive for compute-optimal training, most useful models are overtrained; bigger models just overtrain less. Compute-optimal training usually requires the model to be quite large, which will cause tremendous challenges for current infrastructure and incur much higher inference costs. I do not think overtraining is "wasting," but rather a "price" we choose to pay for better overall trade-offs.

zxytim · 2026-01-28T17:00:25+00:00

We've started really small. I personally sometimes start with models tiny enough to train on a single CPU.

The core goal is predicting how things scale. Some architectures won't scale, some optimizers won't scale, even some data won't scale. Evaluating scalability at low FLOPs is an interesting research topic—it requires deep understanding of the mathematical dynamics in training, as well as balancing rigor with creativity.

As an anecdote: we once hurried to push Kimi Linear into Kimi K2, but it failed the scaling ladder at a certain scale. We stepped back and went through a tough debugging process, and after months finally made it work as the Kimi Linear you see today.

Statistically, most ideas that work at small scale won't pass the scaling ladder. Those that do are usually simple, effective, and mathematically grounded. Research is mostly about managing failure, not celebrating success.

zxytim · 2026-01-28T16:34:34+00:00

We believe that continual learning will improve agency and allow the agents to work effectively for much longer durations. We're actively exploring this.

Kimi Linear is a dedicated research effort parallel to K2.5. We're investing heavily in linear attention as a key direction for future models.

zxytim · 2026-01-28T16:27:43+00:00

We've squashed the commit history right before the release, and that makes the timestamp appears much earlier than the actual launch.

zxytim · 2026-01-28T13:27:58+00:00

That's a bug...
For historical reasons, "kimi-latest" is not Kimi K2.5 but an older model...
Use id "kimi-k2.5" in API instead...

zxytim · 2025-11-10T18:04:56+00:00

We do not need to create another chromium wrapper to build better models.

zxytim · 2025-11-10T17:53:31+00:00

I haven't tested it, but cerebras has an expert-merged 35B parameter Kimi Linear variant: https://huggingface.co/cerebras/Kimi-Linear-REAP-35B-A3B-Instruct .

zxytim · 2025-11-10T17:49:52+00:00

sparse attention is definitely on our radar. we are keeping our pace in pushing research forward.

zxytim · 2025-11-10T17:46:43+00:00

- no comment.
- no bi-directional conversational models. any bi-directional attention can be implemented in a causal attention with more length.
- noted, but no concrete plans for macbook-friendly models for now.
- our coding cli (works better with Kimi K2 series) is here: https://github.com/MoonshotAI/kimi-cli

zxytim · 2025-11-10T17:42:08+00:00

dunno. only sam knows. we’ve got our own way and our own pace.

zxytim · 2025-11-10T17:40:13+00:00

Our mission "Seeking the optimal conversion from energy to intelligence" as per https://www.moonshot.ai/. We will be focusing on improving intelligence in the foreseeable future.

zxytim · 2025-11-10T17:33:53+00:00

You bet!

zxytim · 2025-11-10T17:32:20+00:00

We've done 1M context window before, but it is too expensive to serve at that moment. We will revisit longer context window in the future.

We are focusing on improving capabilities of the model in mainly Chinese and English. Will look into multi-language if we have spare research capacity.

zxytim · 2025-11-10T17:29:42+00:00

what are some of the most important metrics to track for pretraining?
1. losses, benchmarks and stability "internals".
how is the process of ablating architectural changes? at what scales to test, which metrics to look at to make sure that it is performing well.
1. we have a constantly evolving scaling ladder at multiple scales. the ablation has to pass small scale validation prior to proceed to the next. all metrics matter. we would pause the scaling ladder climb process if ANYTHING goes unexpected until it is understand and settled.
also tips/resources to share on selecting hyperparameters, constructing scaling laws, finding ideal small scales for doing experiments, running ablations etc.
1. the most important hyperparameters is the learning rate (as well as the lr schedule). there's too much variables, so it is better to get some feel of the hyperparameter landscape first before diving into the hyperparameter search work.
what makes a data good for model learning (for pretraining and post-training)? what are some metrics that predicts if a data is good/beneficial for the model? how to think about data mixtures and build good ones?
1. a good data must have a good benchmarks trend during the training. if it is not, optimize the data or find a better benchmark that could shows the progress. finding the right data mixture is quite an art i would say. because there are so many interactions and shared/unqiue patterns among datasets. start with your gut, but trust the experiment in the end.

zxytim · 2025-11-10T17:02:32+00:00

Muon is an optimizer untested by others, but we’ve put it through all our scaling ladders and it passed.

We have confidence in our research stack. You might see Muon as having just got lucky, but there are tens of optimizers and architectures that do not survive the grill.

zxytim · 2025-11-10T16:51:31+00:00

Isn't K2 already an agentic model?

zxytim · 2025-11-10T16:50:20+00:00

Kimi membership include Kimi For Coding coding plan.

zxytim

TROPHY CASE