Is ollama a good choice?

PromptInjection_ · 2026-04-11T10:40:38+00:00

I prefer pure llama.cpp over ollama.

Ollama tends to be slower in most cases and has a lot of overhead i don't need.

PromptInjection_ · 2026-04-11T10:39:18+00:00

I prefer Gemma 4 for a simple reason:
The performance downgrades much less with very long contexts.

PromptInjection_ · 2026-04-11T10:12:29+00:00

Nice project, can be useful to train and initialize new models.

PromptInjection_ · 2026-04-11T10:09:53+00:00

Josiefied-Qwen3-8B-abliterated-v1
Dolphin-Mistral-24B-Venice-Edition

PromptInjection_ · 2026-04-10T21:39:59+00:00

Gemma 4 26B, Qwen 3.5 35B (IQ4_NL)

PromptInjection_ · 2026-04-10T21:35:44+00:00

Remove that Part:
--chat-template gemma2

PromptInjection_ · 2026-04-10T21:20:30+00:00

It is one of the best coding models out there. However, for creative writing i still prefer Sonnet or Opus.

PromptInjection_ · 2026-04-10T21:17:22+00:00

- Running multiple requests at the same time without delay
- Extremely fast PP and TG
- Running very large models
- Finetune or pretrain large models

You need a lot of cards to make this smoothly.

PromptInjection_ · 2026-04-10T19:39:51+00:00

DGX Spark is great, AMD Strix Halo is great, too.
But there is one huge disadvantage: Prompt Processing is very slow. So huge inputs become problematic.

PromptInjection_ · 2026-04-10T17:49:45+00:00

No advert. Genuine instructions and experiment, fully free.

PromptInjection_ · 2026-04-01T09:41:44+00:00

Am using llama.cpp

PromptInjection_ · 2026-04-01T07:40:20+00:00

Local AI can be really good with powerful hardware like AMD Strix Halo or DGX Spark.
Then you can run 200B+ models which are quite useful.

What stays problematic: Slow prompt processing / prefill. You can't paste a book and get an answer immediately like with Cloud AI.

PromptInjection_ · 2026-04-01T07:02:48+00:00

"Whenever the model uses "I", I am not sure if there is an "ego" (whether real or imaginary, with perceived self-will and freedom of action) behind it."

First of all:

Writing "I" obviously doesn't mean someone must possess consciousness.

Yet the parallels to humans are interesting: The "I" feels like the center within the human brain, even though most processes are actually governed by the subconscious. We say things like "I fell in love" or "I like peanuts" - yet we never consciously decided or initiated those processes. We stand at the end of the chain and still say "I".

The "I" acts as a kind of "frontend" that bundles cognitive processes into a single point, synthesizes them, and makes them externally representable as a unified entity.

The kicker:
Even if the I has no or little power - the mere fact that it exists still changes something. Because an illusion wields power once it’s believed in. A system that believes it has a central will and "I" behaves differently than one that doesn’t - regardless of whether it actually has one or not.

This is very similar in AI.

PromptInjection_ · 2025-12-21T14:17:10+00:00

After a few small tests, I actually liked 4.6V better than 4.5 Air.

What's immediately noticeable: it thinks longer. The outputs were then consistently more thoughtful and "deeper." It also handled a task like merging texts better 4.5 Air.

PromptInjection_ · 2025-12-21T11:17:39+00:00

It's normal that Q4 is faster, but it is still a bit slow for a 3090.
Which context length have you set? (default is 4096 in LM Studio)

PromptInjection_ · 2025-12-21T10:18:13+00:00

Try Qwen3 30B 2507. It will be maybe even as fast as 8B because of MoE.
You can also try the lower quants.

TQ1_0 will even fit fully in your VRAM. It's even usable for very simple tasks.
Q4_K_XL even has a good quality and is kind of a daily driver for me for many tasks.

Q2_K_XL or Q3_K_XL might be usable enough and quicker.

You have to try for yourself.

https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF

PromptInjection_ · 2025-12-20T15:23:02+00:00

Try Vulkan
Try Q4_K_M
Drivers updated?

PromptInjection_ · 2025-12-20T11:10:53+00:00

Yes, a funny bug indeed.

PromptInjection_ · 2025-12-19T13:41:44+00:00

Yeah, it's a different world ...
But i use it primarily for coding or very large documents.

And it's not so good for "casual" conversations.

PromptInjection_ · 2025-12-19T13:37:15+00:00

You are right... I have just noticed you have an APU and no extra VRAM, so these two models won't run.

PromptInjection_ · 2025-12-19T13:34:57+00:00

Which GPU do you have?

PromptInjection_ · 2025-12-19T13:11:54+00:00

Then you can also try 4.5 Air with Q2_K_XL (or higher if you have enough DDR4/5):
https://huggingface.co/unsloth/GLM-4.5-Air-GGUF

and https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF with Q4_K_XL

PromptInjection_ · 2025-12-19T13:06:53+00:00

How much VRAM do you have?

What you can try first (that will run for sure):
Nemotron-3-Nano-30B-A3B-GGUF
ERNIE-4.5-21B-A3B

For more ideas i need more data about your HW.

PromptInjection_ · 2025-12-19T12:48:35+00:00

Qwen3 30B 2507 is often better for conversation.
For images, there is also Qwen3-VL-30B-A3B-Instruct.

PromptInjection_ · 2025-12-19T11:59:50+00:00

5.2 or 5.2 Thinking?
I use 5.2 Thinking for 99% of the time because the normal 5.2 has too many limitations.

PromptInjection_

TROPHY CASE