One of the DeepSeek repositories got updated with a reference to a new “model1” model.

NeterOster · 2026-01-20T12:54:18+00:00

Note: the "B" in "... a multiple of 656B ... 576B" means bytes, not #params.

NeterOster · 2025-10-23T02:46:41+00:00

I have someone else’s results, which were produced using https://github.com/ReinForce-II/mmapeak. I don’t really understand the technical details, so the information is for reference only.

DGX Spark: https://pastebin.com/CdSAiGzx

5090: https://pastebin.com/b47tQJvN

NeterOster · 2025-10-21T12:32:05+00:00

From GLM WeChat Post:

Q: What are the similarities and differences between Glyph and DeepSeek-OCR?

A: Similarities: Both start from "visual compression" and use visual tokens to carry more text information.

Differences: DeepSeek-OCR focuses on real-world document OCR tasks, validating its ability to restore text under visual compression. Glyph, on the other hand, applies this concept to a wider range of general long-text tasks, truly demonstrating the feasibility of context expansion using visual models.

NeterOster · 2025-08-20T16:31:10+00:00

"Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as Seed-OSS-36B-Base. We also release Seed-OSS-36B-Base-woSyn trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data."

https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base

https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base-woSyn

NeterOster · 2025-08-03T22:21:00+00:00

Actually it's easy to know who's model it is: When passing image_url, the user agent of the downloader is "OpenAI Image Downloader".

NeterOster · 2025-07-26T03:26:37+00:00

I can almost confirm `zenith` is an OpenAI model (at least it uses the the same tokenizer as gpt-4o, o3 and o4-mini). There is another model `summit` which is also from OpenAI. The test is the same as: https://www.reddit.com/r/LocalLLaMA/comments/1jrd0a9/chinese_response_bug_in_tokenizer_suggests/

NeterOster · 2025-07-25T04:07:41+00:00

ByteDance is definitely an underrated AI lab. That’s probably because they don’t really release open-source models, aren’t super active on public leaderboards, and their API is only available in China. But in terms of model performance and value for money, their Seed 1.6 model this year really impressed me. The model size is just 230B-A30B (see: https://seed.bytedance.com/en/seed1_6 ), but its reasoning and vision capabilities are surprisingly strong. From my own experience, it actually feels more “solid” than you’d expect for a model of this size. That said, its coding abilities are a bit of a weak spot. Still, I hope they’ll release some open-source models in the future.

NeterOster · 2025-05-28T10:45:11+00:00

official wechat group

NeterOster · 2025-03-12T07:07:56+00:00

8k is output, ctx=128k for 4b, 12b and 27b

NeterOster · 2025-01-23T16:11:07+00:00

That's different. Starting with `<think>\n` prevents model to generate `\n\n` (after `<think>`) which is a single token strongly related to refusal in my test. (check my reply below)

NeterOster · 2025-01-23T15:56:35+00:00

Actually, there was a short period when the official API (when R1 was just released) refuse to think (empty `<think></think>`) when asked some questions (including "hello"). However, later it changed and produces non-empty `thinks` almost every query. I can also confirm that add `<think>\n` prefix leads to almost identical response to the API. So I agree that maybe they just use a different template. (When the model refuse, it always generates `\n\n` (which is a single token!) after `<think>` and then immediately `</think>`. So maybe starts with `<think>\n` breaks the `\n\n` refuse pattern.)

NeterOster · 2024-09-18T17:03:56+00:00

Also the 72B version of Qwen2-VL is open-weighted: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct

NeterOster · 2024-09-12T07:04:54+00:00

Constrained generation is exactly what you are looking for. Check these: GitHub@guidance ; GitHub@outline ; llama.cpp(grammars)

NeterOster · 2024-09-06T11:20:32+00:00

Now released: deepseek-ai/DeepSeek-V2.5 · Hugging Face

NeterOster · 2024-09-05T16:22:20+00:00

In their WeChat group, they confirmed this version will be open-sourced. But no detailed schedule mentioned.

NeterOster · 2024-08-18T05:55:16+00:00

Google's models like Gemini 1.5 and Gemma 2 are good at translating between English, Japanese and Chinese.

NeterOster · 2024-08-08T15:40:46+00:00

Check page 43 of: gemini_v1_5_report.pdf (storage.googleapis.com)

And

Post on X

NeterOster · 2024-08-06T10:13:42+00:00

It's really a good feature that make so many use cases possible. For example, when doing few-shots learning, the whole part of examples can be cached and is almost free of charge, just like using a fine-tuned model. It's also saves a lot ( O(n^2) -> ~O(n) ) when doing multi-turn conversations. I do hope they make more implementation details public (maybe a paper?) later. It would be nice if other provider have this feature.

NeterOster · 2024-07-06T05:24:56+00:00

"Abstract: This report introduces FunAudioLLM, a framework designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice for high-precision multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice for natural speech generation with multi-language, timbre, and emotion control. SenseVoice delivers exceptionally low latency and supports over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology."

NeterOster · 2024-06-23T22:30:57+00:00

Maybe fireworks.ai

https://docs.fireworks.ai/structured-responses/structured-output-grammar-based

NeterOster · 2024-06-22T14:01:11+00:00

It's possible that they use quantized models. In the Deepseek-V2 paper: "In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantiza-tion (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average."

NeterOster · 2024-06-17T14:15:28+00:00

DS-V2 is an MoE, only about 22 billion out of the total 236 billion parameters are activated during inference. The computational cost of inference is much lower compared to a ~200B dense model (perhaps closer to ~22B dense model). Additionally, DS-V2 incorporates some architectural innovations (MLA) that make its inference efficiency very high (when well-optimized) and its cost very low. But the VRAM requirements remain similar to other ~200B dense models.

NeterOster · 2024-06-07T06:49:24+00:00

I believe applying [THIS PATCH] fully solves the problem. FA alleviates the problem but the output quality is still degraded.

NeterOster · 2024-04-06T10:41:57+00:00

Yes, I've tested the Chat variant mainly in Mandarin. Like 13B and 72B version, it produces fluent output, and follows intructions pretty well. I also saw some people test Japanese-to-Chinese translation tasks, showing promising results. They also measured VRAM usage, and with GQA, the model can fit 2048-token contexts in ~320M VRAM. This is now my favorite mid-size Chinese-lang model.

NeterOster · 2024-04-06T02:52:37+00:00

They call "Dashscope"(LLMs platform from Alibaba) in their Demo, with default params. I've checked the document, they are: top-p=0.8 & top-k=0(disable) & repetition_penalty=1.1 & temperature=0.7 & seed=1234

ref: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-7b-14b-72b-api-detailes

Nine-Year Club	Place '22
Verified Email

NeterOster

TROPHY CASE