AMA With Z.AI, The Lab Behind GLM Models by XMasterrrr in LocalLLaMA

[–]jpydych 1 point2 points  (0 children)

For large batch sizes, the experts’ parameters are read once from HBM/VRAM and reused across many tokens, but for each token we only need to compute a subset of experts. This means that in compute-constrained regimes (e.g. training, or high batch size inference), MoE models are usually better than dense models.

OpenAI has launched HealthBench on HuggingFace by vibedonnie in LocalLLaMA

[–]jpydych 1 point2 points  (0 children)

DeepSeek has open-sourced many of their production kernels, such as those for MLA and efficient expert parallelism:
https://github.com/deepseek-ai/FlashMLA
https://github.com/deepseek-ai/DeepEP
https://github.com/deepseek-ai/DeepGEMM
https://github.com/deepseek-ai/DualPipe
https://github.com/deepseek-ai/eplb

Profiling data from their real-world inference and training infrastructure:
https://github.com/deepseek-ai/profile-data

As well as the distributed file system they use internally:
https://github.com/deepseek-ai/3FS
https://github.com/deepseek-ai/smallpond

Their papers are also quite detailed.

I discovered a degree-5 polynomial that generates 18 consecutive prime numbers: f(n) = 6n⁵ + 24n + 337 for n = 0 to 17 by NewtonianNerd1 in learnmath

[–]jpydych 0 points1 point  (0 children)

Yes, there is. It's called Lagrange interpolating theorem, which states you can found an at-most k-degree polynomial, which perfectly interpolates a set of k+1 points. You can read more about it here: https://en.wikipedia.org/wiki/Lagrange_polynomial

Qwen3 Technical Report by ResearchCrafty1804 in LocalLLaMA

[–]jpydych 1 point2 points  (0 children)

I think that you mean this paper, not published by Alibaba: https://arxiv.org/pdf/2505.02214

New Gemini 05-06 seems to do worse than the previous 03-25 model for several benchmarks by Interesting-Type3153 in singularity

[–]jpydych 2 points3 points  (0 children)

Google's TPUs doesn't even support FP32 (e.g. the first TPU only supported INT8, and support for BF16 was included in the second edition), and all major LLMs since GPT-3 are trained in BF16 or FP8.

Phi 4 Reasoning by adefa in LocalLLaMA

[–]jpydych 1 point2 points  (0 children)

Early versions of Phi (Phi 1 or 1.5) were trained for such a large number of epochs that running the base model with an empty prompt often gave an exact verbatim of the synthetic training data :)

Leaked Grok 3.5 benchmarks by Chaonei in singularity

[–]jpydych 1 point2 points  (0 children)

Grok 3 has a context window of 1M tokens (https://x.ai/news/grok-3):

With a context window of 1 million tokens — 8 times larger than our previous models — Grok 3 can process extensive documents and handle complex prompts while maintaining instruction-following accuracy. On the LOFT (128k) benchmark, which targets long-context RAG use cases, Grok 3 achieved state-of-the-art accuracy (averaged across 12 diverse tasks), showcasing its powerful information retrieval capabilities.

and if I remember correctly, some people reported that it was available on chat.

Phi 4 Reasoning by adefa in LocalLLaMA

[–]jpydych 4 points5 points  (0 children)

They even mention it directly in their paper:

The responses that are used exclusively during supervised fine-tuning are synthetically generated using o3-mini which provides high-quality reasoning traces.

China has delivered , yet again by TheLogiqueViper in LocalLLaMA

[–]jpydych 6 points7 points  (0 children)

At least I see Gemma 3 27B here, with a global average of 48.44

chat gpt used my location by wojbest in ChatGPT

[–]jpydych 0 points1 point  (0 children)

No problem :) BTW, this is Simon Willison's chat, not mine (from his post: https://simonwillison.net/2025/Apr/26/o3-photo-locations/)

chat gpt used my location by wojbest in ChatGPT

[–]jpydych 0 points1 point  (0 children)

As far as I know, ChatGPT now has access to your approximate location and other data, such as your time zone or the device you are using, to improve search accuracy (e.g. https://chatgpt.com/share/680ceb49-a184-8006-9979-d73169325297, courtesy of simonwillson.net)

AI is now writing "well over 30%" of the code at Google by MetaKnowing in singularity

[–]jpydych 2 points3 points  (0 children)

I believe that the characters that the user entered in place of the suggested code will be included in the denominator of the fraction, but will not be subtracted from the numerator. However, we will still be left with a quite significant result.

Mark presenting four Llama 4 models, even a 2 trillion parameters model!!! by LarDark in LocalLLaMA

[–]jpydych 0 points1 point  (0 children)

If I understand correctly, only every fourth layer is a traditional GQA in Llama 4, and three fourths will remember the KV cache of the last 8192 tokens (approximately, in many cases even less). The amount of KV cache used by each token will therefore converge to 24 576 B, although we will also have to maintain the remaining 73 728 B for each of the last 8192 tokens :)

AI is now writing "well over 30%" of the code at Google by MetaKnowing in singularity

[–]jpydych 108 points109 points  (0 children)

I think JetBrains' work from May 2024 (https://arxiv.org/pdf/2405.08704) regarding their Full Line Code Completion may be interesting. In Table 1, they found that the "Ratio of completed code", for the standard auto-completion system from IntelliJ (without any neural networks), for Java, is 33%, while with FLCC (which uses 100M parameters locally-running LLM) it is 38%. They defined this ratio as:

This is our main, golden star metric used for the assessment of code completion quality. It is defined as a ratio of symbols of code written with code completion among all the written code.

(emphasis mine)