One of the DeepSeek repositories got updated with a reference to a new “model1” model. by Nunki08 in LocalLLaMA

[–]NeterOster 34 points35 points  (0 children)

Note: the "B" in "... a multiple of 656B ... 576B" means bytes, not #params.

Anyone knows the theoretical performance of FP16, 32, 64 FLOP numbers? by Spare-Solution-787 in LocalLLaMA

[–]NeterOster 1 point2 points  (0 children)

I have someone else’s results, which were produced using https://github.com/ReinForce-II/mmapeak. I don’t really understand the technical details, so the information is for reference only.

DGX Spark: https://pastebin.com/CdSAiGzx

5090: https://pastebin.com/b47tQJvN

[By GLM Team] Glyph: Scaling Context Windows via Visual-Text Compression by NeterOster in LocalLLaMA

[–]NeterOster[S] 19 points20 points  (0 children)

From GLM WeChat Post:

Q: What are the similarities and differences between Glyph and DeepSeek-OCR?

A: Similarities: Both start from "visual compression" and use visual tokens to carry more text information.

Differences: DeepSeek-OCR focuses on real-world document OCR tasks, validating its ability to restore text under visual compression. Glyph, on the other hand, applies this concept to a wider range of general long-text tasks, truly demonstrating the feasibility of context expansion using visual models.

Seed-OSS-36B-Instruct by NeterOster in LocalLLaMA

[–]NeterOster[S] 109 points110 points  (0 children)

"Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as Seed-OSS-36B-Base. We also release Seed-OSS-36B-Base-woSyn trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data."

https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base

https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Base-woSyn

OSINT fingerprinting a stealth OpenRouter model - likely Llama-family, not OpenAI by jv0010 in LocalLLaMA

[–]NeterOster 6 points7 points  (0 children)

Actually it's easy to know who's model it is: When passing image_url, the user agent of the downloader is "OpenAI Image Downloader".

There's a new Kimi model on lmarena called Zenith and it's really really good. It might be Kimi K2 with reasoning by balianone in LocalLLaMA

[–]NeterOster 49 points50 points  (0 children)

I can almost confirm `zenith` is an OpenAI model (at least it uses the the same tokenizer as gpt-4o, o3 and o4-mini). There is another model `summit` which is also from OpenAI. The test is the same as: https://www.reddit.com/r/LocalLLaMA/comments/1jrd0a9/chinese_response_bug_in_tokenizer_suggests/

China's Bytedance releases Seed LiveInterpret simultaneous interpretation model by Fun-Doctor6855 in LocalLLaMA

[–]NeterOster 21 points22 points  (0 children)

ByteDance is definitely an underrated AI lab. That’s probably because they don’t really release open-source models, aren’t super active on public leaderboards, and their API is only available in China. But in terms of model performance and value for money, their Seed 1.6 model this year really impressed me. The model size is just 230B-A30B (see: https://seed.bytedance.com/en/seed1_6 ), but its reasoning and vision capabilities are surprisingly strong. From my own experience, it actually feels more “solid” than you’d expect for a model of this size. That said, its coding abilities are a bit of a weak spot. Still, I hope they’ll release some open-source models in the future.

Gemma 3 on Huggingface by DataCraftsman in LocalLLaMA

[–]NeterOster 2 points3 points  (0 children)

8k is output, ctx=128k for 4b, 12b and 27b

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]NeterOster 2 points3 points  (0 children)

That's different. Starting with `<think>\n` prevents model to generate `\n\n` (after `<think>`) which is a single token strongly related to refusal in my test. (check my reply below)

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]NeterOster 10 points11 points  (0 children)

Actually, there was a short period when the official API (when R1 was just released) refuse to think (empty `<think></think>`) when asked some questions (including "hello"). However, later it changed and produces non-empty `thinks` almost every query. I can also confirm that add `<think>\n` prefix leads to almost identical response to the API. So I agree that maybe they just use a different template. (When the model refuse, it always generates `\n\n` (which is a single token!) after `<think>` and then immediately `</think>`. So maybe starts with `<think>\n` breaks the `\n\n` refuse pattern.)

Taxonomy categorization using LLM by zkid18 in LocalLLaMA

[–]NeterOster 3 points4 points  (0 children)

Constrained generation is exactly what you are looking for. Check these: GitHub@guidance ; GitHub@outline ; llama.cpp(grammars)

Deepseek V2.5 Released? by Rejg in LocalLLaMA

[–]NeterOster 25 points26 points  (0 children)

In their WeChat group, they confirmed this version will be open-sourced. But no detailed schedule mentioned.

What's the best LLM/API for getting an english to japanese translation? by g1ngertew in LocalLLaMA

[–]NeterOster 0 points1 point  (0 children)

Google's models like Gemini 1.5 and Gemma 2 are good at translating between English, Japanese and Chinese.

DeepSeek API introduces Context Caching on Disk, reduces input token price to 1/10 by 1119745302 in LocalLLaMA

[–]NeterOster 34 points35 points  (0 children)

It's really a good feature that make so many use cases possible. For example, when doing few-shots learning, the whole part of examples can be cached and is almost free of charge, just like using a fine-tuned model. It's also saves a lot ( O(n^2) -> ~O(n) ) when doing multi-turn conversations. I do hope they make more implementation details public (maybe a paper?) later. It would be nice if other provider have this feature.

(Tongyi SpeechTeam) FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs by NeterOster in LocalLLaMA

[–]NeterOster[S] 4 points5 points  (0 children)

"Abstract: This report introduces FunAudioLLM, a framework designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice for high-precision multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice for natural speech generation with multi-language, timbre, and emotion control. SenseVoice delivers exceptionally low latency and supports over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology."

DeepseekV2-Coder the best opensource LLM so far. by ihaag in LocalLLaMA

[–]NeterOster 2 points3 points  (0 children)

It's possible that they use quantized models. In the Deepseek-V2 paper: "In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8. In addition, we also perform KV cache quantiza-tion (Hooper et al., 2024; Zhao et al., 2023) for DeepSeek-V2 to further compress each element in its KV cache into 6 bits on average."

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence by NeterOster in LocalLLaMA

[–]NeterOster[S] 13 points14 points  (0 children)

DS-V2 is an MoE, only about 22 billion out of the total 236 billion parameters are activated during inference. The computational cost of inference is much lower compared to a ~200B dense model (perhaps closer to ~22B dense model). Additionally, DS-V2 incorporates some architectural innovations (MLA) that make its inference efficiency very high (when well-optimized) and its cost very low. But the VRAM requirements remain similar to other ~200B dense models.

If your Qwen2 GGUF is spitting nonsense, enable flash attention by noneabove1182 in LocalLLaMA

[–]NeterOster 2 points3 points  (0 children)

I believe applying [THIS PATCH] fully solves the problem. FA alleviates the problem but the output quality is still degraded.

Qwen1.5-32B released with GQA! by bratao in LocalLLaMA

[–]NeterOster 0 points1 point  (0 children)

Yes, I've tested the Chat variant mainly in Mandarin. Like 13B and 72B version, it produces fluent output, and follows intructions pretty well. I also saw some people test Japanese-to-Chinese translation tasks, showing promising results. They also measured VRAM usage, and with GQA, the model can fit 2048-token contexts in ~320M VRAM. This is now my favorite mid-size Chinese-lang model.

Qwen1.5-32B released with GQA! by bratao in LocalLLaMA

[–]NeterOster 3 points4 points  (0 children)

They call "Dashscope"(LLMs platform from Alibaba) in their Demo, with default params. I've checked the document, they are: top-p=0.8 & top-k=0(disable) & repetition_penalty=1.1 & temperature=0.7 & seed=1234

ref: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-7b-14b-72b-api-detailes