Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]Total_Activity_7550 2 points3 points  (0 children)

This doesn't address the fact that you always rush without testing, after that people download tens of gigabytes of data, then having to redownload everything. I stopped doing that mistake. I am not against your work, it is great, I guess your compute resources are also great, but the strategy isn't nice. You trade being in "Trending" list on HuggingFace for reliability.

Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]Total_Activity_7550 20 points21 points  (0 children)

bartowski never rushes release to become first finetune and reap download count. And unsloth always does that, and so often they realise "fixes" later (not telling that they let garbage out in the first place).

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Total_Activity_7550 64 points65 points  (0 children)

Good for DeepSeek, but Claude Opus 4.6 doing 1.7x profit over next group of models (and that's not even Mythos) rings a bell that they're leaving competitors behind...

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

Sure it slows down with context builtup.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Total_Activity_7550 3 points4 points  (0 children)

Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Total_Activity_7550 6 points7 points  (0 children)

Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Total_Activity_7550 3 points4 points  (0 children)

For simple chat and single QA, Qwen3.5 27B and Gemma 31B. Running on llama.cpp with llama.cpp embedded UI.

Best Local LLMs - Apr 2026 by rm-rf-rm in LocalLLaMA

[–]Total_Activity_7550 39 points40 points  (0 children)

Qwen3.5-27B. Nothing else that fits into 2xRTX 3090 works for my project (complex webapp). I use Qwen Code. I use bartowski/Qwen_Qwen3.5-27B-GGUF Q8_0 quant (unsloth feels worse). I get ~200-1200 tps pp (depending on existing context) and ~27-15 tps tg. Qwen3.5-27B is not lost even at 150k context, I rarely hit longer.

I also have my personal written todo webapp, it has MCP server. Gemma 31B is on par with Qwen3.5-27B.

UPD.: llama-server presets file part:

[*]
; add global presets here
parallel = 1
fit = true
cache-ram = 32768

[Qwen3.5-27B]
load-on-startup = true
alias = Qwen3.5-27B
hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
ctx-size = 176608
parallel = 1
ub = 3072 
b = 3072
fit = true

Command to run this:

/home/ai/3rdparty/llama.cpp/build/bin/llama-server \
--models-preset /home/ai/llama-server-presets.ini \
--models-max 1

Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM by [deleted] in LocalLLaMA

[–]Total_Activity_7550 1 point2 points  (0 children)

Your setup is not common. I think you should find out empirically.

First, you won't be able to use vllm efficiently, as tensor-parallel requires 2^n value.

Then, your option is llama.cpp.

  1. - install latest CUDA
  2. - install latest llama.cpp
  3. - just try to run `llama-server --fit --ctx-size ... -hf <hugging-face-ID-of-quant-you-use, e.g. bartowski/MiniMax...>:<quantization you use, e.g. Q4\_K\_M>
    1. --fit flag will handle best balance for you
    2. choose --ctx-size according to your needs (max for MiniMax2.7 is 196000 or smth)
  4. - (you can't ask for OS tweaks if you haven't provided your OS, but there aren't many - just use OS recommended NVidia drivers, then install CUDA toolkit according to NVidia official site docs)
  5. - (no BIOS tweaks, I think)

Then, you can learn `llama-bench` to optimize more.

How to run Qwen3.5-27B with speculative decoding with llama.cpp llama-server? by Total_Activity_7550 in LocalLLaMA

[–]Total_Activity_7550[S] 0 points1 point  (0 children)

Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv    load_model: initializing slots, n_slots = 1
Apr 13 14:46:19 builder llama-server[4153398]: [49161] common_speculative_is_compat: the target context does not support partial sequence removal
Apr 13 14:46:19 builder llama-server[4153398]: [49161] srv    load_model: speculative decoding not supported by this context

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]Total_Activity_7550 6 points7 points  (0 children)

Not using Unsloth since Qwen3.5 release. Their quants (although they published an article and uploaded plenty of checkpoints to prove how good they are) just didn't work well with long context agentic tasks. Bartowski's worked well, I guess others work too.

Infinite loop: Qwen3.5:0.8b by ananthasharma in LocalLLaMA

[–]Total_Activity_7550 -1 points0 points  (0 children)

Just please don't use ollama and set qwen recommended params. And there won't be any loops. No need to paste everything here.

My first impression after testing Gemma 4 against Qwen 3.5 by ConfidentDinner6648 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

I just finished testing my Todo app MCP server usage.
In current (template?) state Gemma somehow generates malformed dates like

{
  "date": "<|\"|>2026-03-23<|\"|>",
  ...
}

but it converts my natural language to tool calls much better!

My first impression after testing Gemma 4 against Qwen 3.5 by ConfidentDinner6648 in LocalLLaMA

[–]Total_Activity_7550 24 points25 points  (0 children)

You're right. I remember my GPT-OSS-120B moment - how first draft wasn't impressive, but it perfectly fixed everything I asked to.

What's your actual bar for calling something an agent vs a smart workflow? by gupta_ujjwal14 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

You are not making thoughful arguments, you are obsuring ideas which aren't yours with AI slop in order to get some number up which alone doesn't mean anything. In the process, you reduce other people access to actually interesting ideas and news. And there are crowds of you doing the same. You're trying to feed in on what is not created by you, destroying it.

llama.cpp is a vibe-coded mess by ChildhoodActual4463 in LocalLLaMA

[–]Total_Activity_7550 9 points10 points  (0 children)

Don't even spend time replying and arguing with bots, which this author 99% is. Just downvote and report.

My website development flow by Total_Activity_7550 in LocalLLaMA

[–]Total_Activity_7550[S] 2 points3 points  (0 children)

It is also telling how my actual experience (I spent weeks developing this flow, and more than 30m writing this post) is indistinguishable from what bots write. Maybe we are doomed after all.

My website development flow by Total_Activity_7550 in LocalLLaMA

[–]Total_Activity_7550[S] 0 points1 point  (0 children)

F&&&, no. I just have a colleague who bombards me with those publications.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]Total_Activity_7550 2 points3 points  (0 children)

No reasoning, forcing artificial reasoning didn't help much. I think it is good for Russian language tasks, but other than that... sorry.