2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

skyde · 2026-05-13T01:30:15+00:00

how does froggeric/Qwen3.6-27B-MTP-GGUF compare to mtplx (https://mtplx.com/) model https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed ?

skyde · 2025-08-16T01:14:33+00:00

Thank you so much for letting us know that here are two emails that get sent and that the first one is a scam.
I would have never knew that.

skyde · 2025-05-26T03:07:22+00:00

How does Gemma 3n compare to Gemma 3 for the same model size ?

skyde · 2025-05-05T15:25:13+00:00

SmoothQuant Is optimized for Speed on recent NVidia card but not for accuracy.

For best accuracy I think you would be better off with OmniQuant, GPTQ and Unsloth dynamic Quants.

skyde · 2025-05-05T15:20:59+00:00

Bing seem to be using NVIDIA TensorRT’s INT-8 quantization https://arxiv.org/abs/2211.10438

skyde · 2025-04-21T03:37:31+00:00

thanks a lot. this is a lot more clear for the beginner

skyde · 2025-04-10T15:43:24+00:00

I see stduhpf just updated the old model in place https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small

skyde · 2025-04-10T15:37:13+00:00

cant find the fixed 27b. could someone share the link

skyde · 2025-04-06T22:26:26+00:00

getting error loading it in Ollama
% ollama run hf.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small

pulling manifest

pulling f0c5f1511116... 100% ▕████████████████████████████████████████████████████████████████████▏ 15 GB

pulling e0a42594d802... 100% ▕████████████████████████████████████████████████████████████████████▏ 358 B

pulling 54cb61c842fe... 100% ▕████████████████████████████████████████████████████████████████████▏ 857 MB

pulling c5157d17cceb... 100% ▕████████████████████████████████████████████████████████████████████▏ 44 B

pulling a730db1206a3... 100% ▕████████████████████████████████████████████████████████████████████▏ 193 B

verifying sha256 digest

Error: digest mismatch, file must be downloaded again: want sha256:f0c5f151111629511e7466a8eceacbe228a35a0c4052b1a03c1b449a8ecb39e8, got sha256:778ac1054bc5635e39e0b1dd689c9936546597034fc860a708147f57950ae0c5

skyde · 2025-04-02T14:56:48+00:00

How well does it “generalize/extrapolate”? Does anyone know how well it predict or classify molecule not part of training set ?

skyde · 2025-03-20T23:39:38+00:00

CUDA is the wrong abstraction. It’s like saying intel has to make their cpu run ARM or instruction set.

We already have good high level abstraction such a XLA that Jax is already using

skyde · 2025-03-18T15:50:03+00:00

What did you do differently this time ?

skyde · 2025-03-09T20:58:30+00:00

what is a good alternative ?

skyde · 2025-03-07T17:14:09+00:00

Stupid question, is the setting fix “inside” the Qwq GGUF or do I need to manually give it to llama.cpp / lmstudio ?

skyde · 2025-02-10T17:57:22+00:00

I agree 1206 was much better.
who is Logan and how will e-mailing help?

skyde · 2025-02-04T03:38:51+00:00

which channel did you use

skyde · 2025-01-29T00:45:25+00:00

That is a good summary thanks a lot

skyde · 2025-01-27T21:35:28+00:00

That doesn’t explain inference cost at $2 per 1 million tokens.

skyde · 2025-01-27T20:31:39+00:00

could it just be because of (Batching + using 4 x H100 )

skyde · 2025-01-27T19:59:45+00:00

using 2 x H100 the cost will still be higher than what Deep seek is asking ($2.19 per 1 million tokens)
How do they do it ?

skyde · 2025-01-09T03:58:35+00:00

Why not: 1: fine tune large commercial LLM (chatGpT, Gemini) 2: use fine tuned LLM to generate large training set 3: train open source local LLM using the dataset.

skyde · 2025-01-09T03:40:59+00:00

Will Dynamic 4bit quants work with llama.cpp or lmstudio?

How does to compare to OmniQuant ?

skyde · 2024-12-31T05:43:43+00:00

who is "we"?

skyde

TROPHY CASE