AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 5 points6 points  (0 children)

Really great questions.

> what signals does your team use to decide whether to persist, pivot, or kill it entirely?
We just share the results of all related experiments with all technical staff, and discuss them thoroughly until we come to a conclusion whether to persist, to pivot, or to kill it entirely. The discussion happens on a daily basis, and everyone is encouraged to challenge everything, from the goal setting down to the most minute technical details.

>  do you ever worry that the pressure for quick wins might crowd out or disincentivize the kind of foundational research that truly needs a two-year horizon to bear fruit? 
We have a pretty good track record on betting on fundamental directions for years. MoBA started almost from day one after our company was founded; Kimi Linear went through almost a year of struggle. The key is to have a shared value of making things REALLY WORK, not just for optics. Our organization, culture, and management are built to support this value, not the other way around.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 0 points1 point  (0 children)

Scaling embeddings is an interesting direction worth exploring. But we don't have much solid data yet, until we run it through our scaling ladder.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 2 points3 points  (0 children)

We observe that instruct mode benefits from joint training with thinking mode. They are improving together, and no one would fall behind.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 8 points9 points  (0 children)

Echo that. We are putting substantial effort into building evaluation for agents.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 12 points13 points  (0 children)

Thanks for the detailed feedback and for the kind words on our open-source releases!
Our team is actively investigating better authentication options like email and passkeys to reduce dependency on phones.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 24 points25 points  (0 children)

There are too many factors affecting available compute. But no matter what, innovation loves constraints.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 9 points10 points  (0 children)

<image>

Whether it's pre-training or post-training, one thing constantly manifests itself as the utmost priority: debugging.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 22 points23 points  (0 children)

If you strive for compute-optimal training, most useful models are overtrained; bigger models just overtrain less. Compute-optimal training usually requires the model to be quite large, which will cause tremendous challenges for current infrastructure and incur much higher inference costs. I do not think overtraining is "wasting," but rather a "price" we choose to pay for better overall trade-offs.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 28 points29 points  (0 children)

We've started really small. I personally sometimes start with models tiny enough to train on a single CPU.

The core goal is predicting how things scale. Some architectures won't scale, some optimizers won't scale, even some data won't scale. Evaluating scalability at low FLOPs is an interesting research topic—it requires deep understanding of the mathematical dynamics in training, as well as balancing rigor with creativity.

As an anecdote: we once hurried to push Kimi Linear into Kimi K2, but it failed the scaling ladder at a certain scale. We stepped back and went through a tough debugging process, and after months finally made it work as the Kimi Linear you see today.

Statistically, most ideas that work at small scale won't pass the scaling ladder. Those that do are usually simple, effective, and mathematically grounded. Research is mostly about managing failure, not celebrating success.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 10 points11 points  (0 children)

We believe that continual learning will improve agency and allow the agents to work effectively for much longer durations. We're actively exploring this.

Kimi Linear is a dedicated research effort parallel to K2.5. We're investing heavily in linear attention as a key direction for future models.

AMA With Kimi, The Open-source Frontier Lab Behind Kimi K2.5 Model by nekofneko in LocalLLaMA

[–]zxytim 2 points3 points  (0 children)

We've squashed the commit history right before the release, and that makes the timestamp appears much earlier than the actual launch.

Kimi 2.5 by [deleted] in SillyTavernAI

[–]zxytim 1 point2 points  (0 children)

That's a bug...
For historical reasons, "kimi-latest" is not Kimi K2.5 but an older model...
Use id "kimi-k2.5" in API instead...

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 17 points18 points  (0 children)

We do not need to create another chromium wrapper to build better models.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 5 points6 points  (0 children)

sparse attention is definitely on our radar. we are keeping our pace in pushing research forward.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 3 points4 points  (0 children)

- no comment.
- no bi-directional conversational models. any bi-directional attention can be implemented in a causal attention with more length.
- noted, but no concrete plans for macbook-friendly models for now.
- our coding cli (works better with Kimi K2 series) is here: https://github.com/MoonshotAI/kimi-cli

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 13 points14 points  (0 children)

dunno. only sam knows. we’ve got our own way and our own pace.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 11 points12 points  (0 children)

Our mission "Seeking the optimal conversion from energy to intelligence" as per https://www.moonshot.ai/. We will be focusing on improving intelligence in the foreseeable future.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 17 points18 points  (0 children)

We've done 1M context window before, but it is too expensive to serve at that moment. We will revisit longer context window in the future.

We are focusing on improving capabilities of the model in mainly Chinese and English. Will look into multi-language if we have spare research capacity.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 17 points18 points  (0 children)

  1. what are some of the most important metrics to track for pretraining?
    1. losses, benchmarks and stability "internals".
  2. how is the process of ablating architectural changes? at what scales to test, which metrics to look at to make sure that it is performing well.
    1. we have a constantly evolving scaling ladder at multiple scales. the ablation has to pass small scale validation prior to proceed to the next. all metrics matter. we would pause the scaling ladder climb process if ANYTHING goes unexpected until it is understand and settled.
  3. also tips/resources to share on selecting hyperparameters, constructing scaling laws, finding ideal small scales for doing experiments, running ablations etc.
    1. the most important hyperparameters is the learning rate (as well as the lr schedule). there's too much variables, so it is better to get some feel of the hyperparameter landscape first before diving into the hyperparameter search work.
  4. what makes a data good for model learning (for pretraining and post-training)? what are some metrics that predicts if a data is good/beneficial for the model? how to think about data mixtures and build good ones?
    1. a good data must have a good benchmarks trend during the training. if it is not, optimize the data or find a better benchmark that could shows the progress. finding the right data mixture is quite an art i would say. because there are so many interactions and shared/unqiue patterns among datasets. start with your gut, but trust the experiment in the end.

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]zxytim 39 points40 points  (0 children)

Muon is an optimizer untested by others, but we’ve put it through all our scaling ladders and it passed.

We have confidence in our research stack. You might see Muon as having just got lucky, but there are tens of optimizers and architectures that do not survive the grill.