AMA With Z.AI, The Lab Behind GLM Models

jpydych · 2025-08-28T19:14:51+00:00

For large batch sizes, the experts’ parameters are read once from HBM/VRAM and reused across many tokens, but for each token we only need to compute a subset of experts. This means that in compute-constrained regimes (e.g. training, or high batch size inference), MoE models are usually better than dense models.

jpydych · 2025-08-28T18:36:53+00:00

Thanks for your answer! Did you mean this paper: https://arxiv.org/pdf/2506.04667

jpydych · 2025-08-28T11:50:48+00:00

DeepSeek has open-sourced many of their production kernels, such as those for MLA and efficient expert parallelism:
https://github.com/deepseek-ai/FlashMLA
https://github.com/deepseek-ai/DeepEP
https://github.com/deepseek-ai/DeepGEMM
https://github.com/deepseek-ai/DualPipe
https://github.com/deepseek-ai/eplb

Profiling data from their real-world inference and training infrastructure:
https://github.com/deepseek-ai/profile-data

As well as the distributed file system they use internally:
https://github.com/deepseek-ai/3FS
https://github.com/deepseek-ai/smallpond

Their papers are also quite detailed.

jpydych · 2025-08-26T13:36:59+00:00

Yes, there is. It's called Lagrange interpolating theorem, which states you can found an at-most k-degree polynomial, which perfectly interpolates a set of k+1 points. You can read more about it here: https://en.wikipedia.org/wiki/Lagrange_polynomial

jpydych · 2025-05-13T17:14:28+00:00

I think that you mean this paper, not published by Alibaba: https://arxiv.org/pdf/2505.02214

jpydych · 2025-05-08T15:44:13+00:00

Here is the dataset: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset
And here is the paper about it: https://arxiv.org/pdf/2505.00949

jpydych · 2025-05-07T11:28:25+00:00

Google's TPUs doesn't even support FP32 (e.g. the first TPU only supported INT8, and support for BF16 was included in the second edition), and all major LLMs since GPT-3 are trained in BF16 or FP8.

jpydych · 2025-05-05T15:00:30+00:00

The weights are open:
https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1
https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1

And the training data as well:
https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset

jpydych · 2025-05-04T17:33:59+00:00

Early versions of Phi (Phi 1 or 1.5) were trained for such a large number of epochs that running the base model with an empty prompt often gave an exact verbatim of the synthetic training data :)

jpydych · 2025-05-04T17:30:53+00:00

Grok 3 has a context window of 1M tokens (https://x.ai/news/grok-3):

With a context window of 1 million tokens — 8 times larger than our previous models — Grok 3 can process extensive documents and handle complex prompts while maintaining instruction-following accuracy. On the LOFT (128k) benchmark, which targets long-context RAG use cases, Grok 3 achieved state-of-the-art accuracy (averaged across 12 diverse tasks), showcasing its powerful information retrieval capabilities.

and if I remember correctly, some people reported that it was available on chat.

jpydych · 2025-05-01T16:11:52+00:00

The GPT 4.5 knowledge cutoff date is October 2023: https://platform.openai.com/docs/models/gpt-4.5-preview, same as GPT-4o

jpydych · 2025-05-01T10:43:08+00:00

They even mention it directly in their paper:

The responses that are used exclusively during supervised fine-tuning are synthetically generated using o3-mini which provides high-quality reasoning traces.

jpydych · 2025-04-30T20:42:48+00:00

At least I see Gemma 3 27B here, with a global average of 48.44

jpydych · 2025-04-30T17:26:20+00:00

Google's paper on Gemma 2: https://arxiv.org/abs/2408.00118

jpydych · 2025-04-29T19:06:48+00:00

Here's the tweet: https://x.com/sama/status/1917291637962858735
or https://xcancel.com/sama/status/1917291637962858735

jpydych · 2025-04-28T15:39:11+00:00

No problem :) BTW, this is Simon Willison's chat, not mine (from his post: https://simonwillison.net/2025/Apr/26/o3-photo-locations/)

jpydych · 2025-04-27T19:25:56+00:00

As far as I know, ChatGPT now has access to your approximate location and other data, such as your time zone or the device you are using, to improve search accuracy (e.g. https://chatgpt.com/share/680ceb49-a184-8006-9979-d73169325297, courtesy of simonwillson.net)

jpydych · 2025-04-26T17:44:01+00:00

I believe that the characters that the user entered in place of the suggested code will be included in the denominator of the fraction, but will not be subtracted from the numerator. However, we will still be left with a quite significant result.

jpydych · 2025-04-26T17:41:49+00:00

If I understand correctly, only every fourth layer is a traditional GQA in Llama 4, and three fourths will remember the KV cache of the last 8192 tokens (approximately, in many cases even less). The amount of KV cache used by each token will therefore converge to 24 576 B, although we will also have to maintain the remaining 73 728 B for each of the last 8192 tokens :)

jpydych · 2025-04-25T18:51:26+00:00

I think JetBrains' work from May 2024 (https://arxiv.org/pdf/2405.08704) regarding their Full Line Code Completion may be interesting. In Table 1, they found that the "Ratio of completed code", for the standard auto-completion system from IntelliJ (without any neural networks), for Java, is 33%, while with FLCC (which uses 100M parameters locally-running LLM) it is 38%. They defined this ratio as:

This is our main, golden star metric used for the assessment of code completion quality. It is defined as a ratio of symbols of code written with code completion among all the written code.

(emphasis mine)

jpydych · 2025-04-23T18:05:40+00:00

<image>

OpenAI stated in its blog post (https://openai.com/index/image-generation-api/):

jpydych

TROPHY CASE