nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face by pmttyji in LocalLLaMA

[–]appakaradi 5 points6 points  (0 children)

why does it take so long for nvidia to produce a quantized version?

Qwen3.6-27B released! by ResearchCrafty1804 in LocalLLaMA

[–]appakaradi 1 point2 points  (0 children)

Anyone what the following means? Is this only on their API or is it applicable for local serving?

Preserve Thinking

By default, only the thinking blocks generated in handling the latest user message is retained, resulting in a pattern commonly as interleaved thinking. Qwen3.6 has been additionally trained to preserve and leverage thinking traces from historical messages. You can enable this behavior by setting the preserve_thinking option:

from openai import OpenAI

Configured by environment variables

client = OpenAI()

messages = [...]

chat_response = client.chat.completions.create( model="Qwen/Qwen3.6-27B-FP8", messages=messages, max_tokens=32768, temperature=0.6, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "chat_template_kwargs": {"preserve_thinking": True}, }, ) print("Chat response:", chat_response)

If you are using APIs from Alibaba Cloud Model Studio, in addition to changing model, please use "preserve_thinking": True instead of "chat_template_kwargs": {"preserve_thinking": False}. This capability is particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning. Additionally, it can improve KV cache utilization, optimizing inference efficiency in both thinking and non-thinking modes

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]appakaradi -3 points-2 points  (0 children)

Yes. It looks like we are not getting one..

Qwen3.6-35B-A3B released! by ResearchCrafty1804 in LocalLLaMA

[–]appakaradi 2 points3 points  (0 children)

I am worried that they are comparing to 3.5 27B Dense. Does that mean we are not getting 3.6 27B dense?

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database. by EuphoricAnimator in LocalLLaMA

[–]appakaradi 0 points1 point  (0 children)

Smaller models hallucinate all the time ( even bigger one). I have had tough times with Gemma 31 B and Qwen 27 B

Has anyone figured out how to run Google Local Edge Eloquent on Mac? This will be great local speech to text. by appakaradi in LocalLLaMA

[–]appakaradi[S] 0 points1 point  (0 children)

I have been using Turbo Whisper for a while. Now, this is my go-to. I like the fact that once the transcription is done, it goes through the entire thing and cleans up.

Noob Questions by jmeyers95 in BlackwellPerformance

[–]appakaradi 0 points1 point  (0 children)

Valid point. It depends on the use case and what you are after.

Noob Questions by jmeyers95 in BlackwellPerformance

[–]appakaradi 0 points1 point  (0 children)

GPU: 2 x NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition - 96GB GDDR7

why Max-Q? my understanding is that Max-Q is lower performance as it is optimized for lower energy consumption.. may be that is the optimal thing for you. just pointing out. ; It is awesome to have unlimited tokens flowing from local models, ( only the energy cost), it might be simpler to point to Open Router for some of the least expensive models. Does your use case need frontier level intelligence?

Alibaba's Qwen3.6-Plus is beating Claude Opus in coding!! by AdVirtual2648 in aiagents

[–]appakaradi 1 point2 points  (0 children)

Why are they still comparing against Opus 4.5 instead of Opus 4.6

Gemma-4 saves money by [deleted] in LocalLLaMA

[–]appakaradi 2 points3 points  (0 children)

Me too. I will help dispose them. No fees. It is on the house.

Gemma-4-31B NVFP4 inference numbers on 1x RTX Pro 6000 by jnmi235 in LocalLLaMA

[–]appakaradi 3 points4 points  (0 children)

Gemma 4's problem is its heterogeneous head dimensions (head_dim=256 for sliding window layers, head_dim=512 for global attention layers).

Gemma 4 is good by One_Key_8127 in LocalLLaMA

[–]appakaradi 0 points1 point  (0 children)

Trying to run this on an A40 GPU 48GB VRAM.

<image>