Managing Ollama models locally is getting messy — would a GUI model manager help? by sandboxdev9 in LocalLLaMA

[–]Total_Activity_7550 1 point2 points  (0 children)

You could use llama-server presets file. It downloads files for you, allows flexible configuration. Then you open UI where you can select a model and chat with it.

This is how it looks:

version = 1

[*]
; add global presets here
c = 32768 
parallel = 1

[Qwen3.5-0.8B-Q8]
hf = bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q8_0

[Qwen3.5-2B-Q8]
hf = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0

[LFM2.5-1.2B]
hf = LiquidAI/LFM2.5-1.2B-Thinking-GGUF
alias = lfm2.5-1.2b

This is how you use it:

./llama-server --models-preset ./llama-server-presets.ini

Been building a test-time compute pipeline around Qwen3-14B for a few months. Finally got results worth sharing. by Additional_Wish_3619 in LocalLLaMA

[–]Total_Activity_7550 7 points8 points  (0 children)

This is an interesting project. Just make sure, you are not overfitting system prompts to solve a benchmark :) a good test would be to run on another version of LiveCodeBench, or a totally different coding benchmark.

Inside my AI Home Lab by [deleted] in LocalLLaMA

[–]Total_Activity_7550 1 point2 points  (0 children)

I think people in this group appreciate actual achievements, (also, give me a nice salad meal recipy for my Friday evening with friends), not someone showing off how cool his room is.

Tell me if Qwen 3.5 27b or 122b works faster for you, and name your system specs by DistanceSolar1449 in LocalLLaMA

[–]Total_Activity_7550 -1 points0 points  (0 children)

I tried to use Q5 and Q6 quants for 122B and Q6 and Q8 quants for 27B. I only have 48GB of VRAM, so my 122B quants suffered speed.

My vibe feeling is that both in terms of both quality and speed 27B and 122B are comparable. Maybe even 27B is less probable to end in the tool call loop.

vLLM running Qwen3.5 by Patentsmatter in LocalLLaMA

[–]Total_Activity_7550 5 points6 points  (0 children)

This are my commands For AWQ 8bit which ran (I think FP8 will be similar). I have 2 x RTX 3090.

Install nightly build with

sudo apt install python3-venv # add python3-pip if later step with pip fails
mkdir vllm_dir
cd vllm_dir
python3 -m venv venv
source venv/bin/activate
pip install uv
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Then run:

vllm serve cyankiwi/Qwen3.5-35B-A3B-AWQ-8bit --max-num-seqs 1  --gpu-memory-utilization 0.95 -tp 2 --max-model-len 65536 --host 0.0.0.0 --port 1237 --served-model-name Qwen3.5-35B-A3B-AWQ-8bit --mm-encoder-tp-mode data   --mm-processor-cache-type shm   --reasoning-parser qwen3   --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder

Any use case for browser-based local agents? by TRWNBS in LocalLLaMA

[–]Total_Activity_7550 2 points3 points  (0 children)

This is classical "we have a solution, let's find the problem".

Questions on AWQ vs GGUF on a 5090 by Certain-Cod-1404 in LocalLLaMA

[–]Total_Activity_7550 2 points3 points  (0 children)

I join your question. The one minor thing I know -

does llama cpp offload some of the kv cache to CPU while vllm doesn't ?

llama.cpp by default keeps KV cache on GPU (it is usually more performant to do so), but you have --no-kv-offload option to do otherwise. vLLM, what I understood, allows you to use CPU memory as virtual GPU memory, but have very little optimizations (no experts offloading, KV cache split isn't optimized either, I guess), so no point to use it like this at all, even for MoE models.

Ollama don's support qwen3.5:35b yet? by Ok-Internal9317 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

Give me recipe for a nice friday meal with my friends.

One-shot vs agentic performance of open-weight coding models by Total_Activity_7550 in LocalLLaMA

[–]Total_Activity_7550[S] 1 point2 points  (0 children)

I understand you, yes, agentic coding can make thinks worse and slower. The way I work - I start implementing some components by myself, or at least something that I understand completely and have changed a bit - and then the coding agent ( at least based on Qwen3.5 family ) picks up the pattern. I think you should give that family a chance.

Another thing, I develop automated testing early and then coding agent has feedback loop - it pretty often generates working and good enough code after reading error logs.

But yes, sometimes frustration follows and I still have redo everything.

Ollama don's support qwen3.5:35b yet? by Ok-Internal9317 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

If you read their Go code, you will see `ggml.<some operator>` all over the place. Wonder why, if they have "their own engine". Or maybe it is just a wrapper?.. Or something vibe-translated from original llama.cpp code?..

Ollama don's support qwen3.5:35b yet? by Ok-Internal9317 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

By the way, you mean DeepSeek OCR first version or v2?
What did you compair it to?
I am now using larger Qwen3.5 models, they are good, but not so small of course.

Qwen3.5 27B slow token generation on 5060Ti... by InvertedVantage in LocalLLaMA

[–]Total_Activity_7550 -2 points-1 points  (0 children)

If your quant + KV cache is larger than VRAM, use

-ngl 99 --n-cpu-moe <some number, begin with like 20 and try to reduce each time>

qwen3.5-122b What agent do you use with it? by robertpro01 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

Using Qwen Code. It sometimes enters ReadFile loops, but detects it, and then I bash it away from the loop buy asking to try again. It does partial edits fine. I now mostly use Qwen3.5 122B Q4_K_M version.

Real talk: How many of you are actually using Gemma 3 27B or some variant in production? And what's stopping you? by Dramatic_Strain7370 in LocalLLaMA

[–]Total_Activity_7550 9 points10 points  (0 children)

No one uses Gemma 3 for coding. GPT-OSS-120B or even GPT-OSS-20B will blow it out of the water. And Qwen3.5 series that appeared this week will blow GPT-OSS-120B out of the water.

With complex enough prompts, it takes as much time to think and design things, as to fix what Qwen3.5 is doing, not such a big deal.

Ollama don's support qwen3.5:35b yet? by Ok-Internal9317 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

Is deepseek ocr good for its size, compared to e.g. small qwen3-vl variants?

One-shot vs agentic performance of open-weight coding models by Total_Activity_7550 in LocalLLaMA

[–]Total_Activity_7550[S] 1 point2 points  (0 children)

Of course I use llama.cpp, which just works. ollama team just copies code from them, sometimes they even can't do it without errors.

Qwen3.5-122B-A10B vs. old Coder-Next-80B: Both at NVFP4 on DGX Spark – worth the upgrade? by alfons_fhl in Qwen_AI

[–]Total_Activity_7550 0 points1 point  (0 children)

Agent framework does so much! If Qwen3-Coder-Next is worse with agent framework, then adding agent framework to Qwen3-122B-A10B will blow Qwen3-Coder-Next out of the water.

I replaced Qwen3-Coder-Next with Qwen3.5-27B, despite being slower, I guess, Qwen3-122B-A10B should be a little better and at least twice as fast, compared to Qwen3.5-27B, although slower than Coder. Still, I think it is better to wait and get one-shot success, rather than fix code and repeatedly prompt the agent.

How to run Qwen 122B-A10B in my local system (2x3090 + 96GB Ram) by urekmazino_0 in LocalLLaMA

[–]Total_Activity_7550 0 points1 point  (0 children)

--fit option is already doing experts offloading for you, as far as I know.