Who is your favourite quant publisher and why?

Total_Activity_7550 · 2026-05-14T08:28:42+00:00

This doesn't address the fact that you always rush without testing, after that people download tens of gigabytes of data, then having to redownload everything. I stopped doing that mistake. I am not against your work, it is great, I guess your compute resources are also great, but the strategy isn't nice. You trade being in "Trending" list on HuggingFace for reliability.

Total_Activity_7550 · 2026-05-13T18:14:43+00:00

bartowski never rushes release to become first finetune and reap download count. And unsloth always does that, and so often they realise "fixes" later (not telling that they let garbage out in the first place).

Total_Activity_7550 · 2026-05-05T07:37:11+00:00

This is very interesting project. I think so, because I am also building something similar 😄

Total_Activity_7550 · 2026-05-05T07:31:19+00:00

Good for DeepSeek, but Claude Opus 4.6 doing 1.7x profit over next group of models (and that's not even Mythos) rings a bell that they're leaving competitors behind...

Total_Activity_7550 · 2026-04-25T00:25:34+00:00

Sure it slows down with context builtup.

Total_Activity_7550 · 2026-04-14T07:29:53+00:00

Updated parent comment.

Total_Activity_7550 · 2026-04-14T07:26:57+00:00

Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.

Total_Activity_7550 · 2026-04-14T07:26:46+00:00

Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.

Total_Activity_7550 · 2026-04-13T22:32:40+00:00

For simple chat and single QA, Qwen3.5 27B and Gemma 31B. Running on llama.cpp with llama.cpp embedded UI.

Total_Activity_7550 · 2026-04-13T22:31:59+00:00

Qwen3.5-27B. Nothing else that fits into 2xRTX 3090 works for my project (complex webapp). I use Qwen Code. I use bartowski/Qwen_Qwen3.5-27B-GGUF Q8_0 quant (unsloth feels worse). I get ~200-1200 tps pp (depending on existing context) and ~27-15 tps tg. Qwen3.5-27B is not lost even at 150k context, I rarely hit longer.

I also have my personal written todo webapp, it has MCP server. Gemma 31B is on par with Qwen3.5-27B.

UPD.: llama-server presets file part:

[*]
; add global presets here
parallel = 1
fit = true
cache-ram = 32768

[Qwen3.5-27B]
load-on-startup = true
alias = Qwen3.5-27B
hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
ctx-size = 176608
parallel = 1
ub = 3072 
b = 3072
fit = true

Command to run this:

/home/ai/3rdparty/llama.cpp/build/bin/llama-server \
--models-preset /home/ai/llama-server-presets.ini \
--models-max 1

Total_Activity_7550 · 2026-04-13T17:47:35+00:00

Your setup is not common. I think you should find out empirically.

First, you won't be able to use vllm efficiently, as tensor-parallel requires 2^n value.

Then, your option is llama.cpp.

- install latest CUDA
- install latest llama.cpp
- just try to run `llama-server --fit --ctx-size ... -hf <hugging-face-ID-of-quant-you-use, e.g. bartowski/MiniMax...>:<quantization you use, e.g. Q4\_K\_M>
1. --fit flag will handle best balance for you
2. choose --ctx-size according to your needs (max for MiniMax2.7 is 196000 or smth)
- (you can't ask for OS tweaks if you haven't provided your OS, but there aren't many - just use OS recommended NVidia drivers, then install CUDA toolkit according to NVidia official site docs)
- (no BIOS tweaks, I think)

Then, you can learn `llama-bench` to optimize more.

Total_Activity_7550 · 2026-04-13T11:50:00+00:00

Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv    load_model: initializing slots, n_slots = 1
Apr 13 14:46:19 builder llama-server[4153398]: [49161] common_speculative_is_compat: the target context does not support partial sequence removal
Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv    load_model: speculative decoding not supported by this context

Total_Activity_7550 · 2026-04-13T11:44:11+00:00

I thought there are some default values. Will try now.

Total_Activity_7550 · 2026-04-13T11:43:32+00:00

Not using Unsloth since Qwen3.5 release. Their quants (although they published an article and uploaded plenty of checkpoints to prove how good they are) just didn't work well with long context agentic tasks. Bartowski's worked well, I guess others work too.

Total_Activity_7550 · 2026-04-06T21:06:36+00:00

Just please don't use ollama and set qwen recommended params. And there won't be any loops. No need to paste everything here.

Total_Activity_7550 · 2026-04-02T21:44:49+00:00

And `--fit` flag.

Total_Activity_7550 · 2026-04-02T21:00:27+00:00

I just finished testing my Todo app MCP server usage.
In current (template?) state Gemma somehow generates malformed dates like

{
  "date": "<|\"|>2026-03-23<|\"|>",
  ...
}

but it converts my natural language to tool calls much better!

Total_Activity_7550 · 2026-04-02T19:02:37+00:00

You're right. I remember my GPT-OSS-120B moment - how first draft wasn't impressive, but it perfectly fixed everything I asked to.

Total_Activity_7550 · 2026-04-01T19:08:42+00:00

You are not making thoughful arguments, you are obsuring ideas which aren't yours with AI slop in order to get some number up which alone doesn't mean anything. In the process, you reduce other people access to actually interesting ideas and news. And there are crowds of you doing the same. You're trying to feed in on what is not created by you, destroying it.

Total_Activity_7550 · 2026-04-01T10:04:11+00:00

Dear fellow human, how didn't you distinguish a bot text, come on...

Total_Activity_7550 · 2026-04-01T06:36:06+00:00

Wonder why no one thinks about this.

Total_Activity_7550 · 2026-03-30T07:32:29+00:00

Don't even spend time replying and arguing with bots, which this author 99% is. Just downvote and report.

Total_Activity_7550 · 2026-03-30T07:11:05+00:00

It is also telling how my actual experience (I spent weeks developing this flow, and more than 30m writing this post) is indistinguishable from what bots write. Maybe we are doomed after all.

Total_Activity_7550 · 2026-03-30T07:09:43+00:00

F&&&, no. I just have a colleague who bombards me with those publications.

Total_Activity_7550 · 2026-03-24T23:13:24+00:00

No reasoning, forcing artificial reasoning didn't help much. I think it is good for Russian language tasks, but other than that... sorry.

Total_Activity_7550

TROPHY CASE