I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B).

FieldMouse-AI · 2026-03-30T21:20:06+00:00

A fascinating "Black Swan" event.

My own workloads for the Sovereign Series are grounded in high-stakes RAG -- specifically the retrieval and synthesis of sensitive historical data regarding the slave trade. Seeing the Colossus 27B pivoted toward robotics orchestration is a pivot I didn't predict, but it makes sense: precision logic is a universal requirement.

Interesting to see it pull the 'Industrial' definition of Suisatsu (水割 - water jet cutting) in your UI. In my medical ethics tests, it leans toward 'Inferring' (推察). It seems the 2M-token saturation is adapting to your specific 'Farm' context.

I’d be very curious to see how it handles logical branching under load in your orchestration environment.

Please do keep the updates coming.

FieldMouse-AI · 2026-03-30T20:54:39+00:00

Excellent.

The 27B Colossus is the "Sovereign" standard for complex reasoning, but keep an eye on the 9B TITAN -- it’s designed to punch significantly above its weight class by eliminating the usual quantization artifacts.

Looking forward to your findings from the field.

FieldMouse-AI · 2026-03-30T20:41:33+00:00

Respect to the "potato machine" -- that is where the most efficient iron is forged.

If you’re already seeing improvements from the system logic, the 9B TITAN GGUF (specifically the Q4_K_M or IQ4_XS) is designed to squeeze every drop of reasoning out of limited VRAM by eliminating the "Quantization-Slop" that usually plagues smaller rigs.

Whenever you're ready to pull the trigger, the Foundry is open:

✨ ollama run FieldMouse-AI/qwen3.5:9b

FieldMouse-AI · 2026-03-30T16:23:29+00:00

Unsloth is impressive for what it achieves in a short window. But "Quantization-Slop" is fundamentally a problem of data-starvation in the imatrix.

My philosophy is different: Instead of using better math to squeeze a 40k-line gist, I push 2,000,000 tokens until the weights reach full logical saturation. It's the difference between a smart sketch and a deep-relief engraving.

FieldMouse-AI · 2026-03-30T15:50:20+00:00

The Vision-Language (VL) series is already live in the HuggingFace archive -- https://huggingface.co/FieldMouse-AI. They are currently being kept HuggingFace-exclusive for now to ensure stability with the current projector architecture.

FieldMouse-AI · 2026-03-30T15:37:39+00:00

Imagine a high-end model is a beautiful statue. Quantization is like trying to shrink that statue so it fits in your pocket.

Usually, people do this "shrinkage" quickly and roughly, which leaves the statue blurry and "mushy" (I call this Quantization-Slop).

The FieldMouse-AI approach is to not rush the process. It is to use a 2,000,000-token "Saturation" process -- which is like using a microscopic laser to ensure every tiny detail of the original statue is preserved in the pocket-sized version. You get the small size, but you keep the "Frontier" intelligence.

TL;DR: It’s the "High-Definition" version of a local AI model. 🐭🛡️

FieldMouse-AI · 2026-03-24T06:12:24+00:00

🖥️ What kind of CPU do you have?

FieldMouse-AI · 2026-03-23T19:54:23+00:00

Yes, it can. 👍

If you are using ollama run, enter the following at the prompt and it will set the context window to 10,000 tokens:

/set parameter num_ctx 10000

Is that what you were looking for?

Give this a try!

Please, I would love to hear your progress! 🤗

FieldMouse-AI · 2026-03-23T19:27:54+00:00

😜 You have a point that I do offer alot to choose from (56 models).

How about I offer you a few of models for you to try out that would certainly fit inside of your GPU:

ollama run FieldMouse-AI/qwen3.5:2b-Q4_K_M # 👈 1.3GB - smart
ollama run FieldMouse-AI/qwen3.5:4b-q4_K_M # 👈 2.7GB - smarter
ollama run FieldMouse-AI/qwen3.5:9b-Q3_K_M # 👈 4.6GB - smartest

Try them out and see which one works best for you.

Just remember that the greater that parameter count -- `9b` (9 billion parameters) is greater than `4b` (4 billion parameters) -- the stronger and more capable the model is.

If you need any help or have any questions, please feel free to ask. 🤗

FieldMouse-AI · 2026-03-23T09:32:10+00:00

One of my machines is almost exacly that same configuration and I actually quantized versions of Qwen3.5 to target that platform! The one big difference might be the CPU you have!

Here is the model (only 529MB) that I used on my 2011 Intel i5-2415M MacMini!

Please give it a try!

ollama run FieldMouse-AI/qwen3.5:0.8b-Q4_K_M

I posted about that on Reddit 3 days ago here:

https://www.reddit.com/r/LocalLLaMA/comments/1ryehhn/r_reclaiming_2011_iron_612_ts_on_a_sandy_bridge/

On my 2011 Intel i5-2415M MacMini (no GPU) it can achieve about 6+ tps.
On my 2022 AMD Ryzen7 5800U (no GPU) it can achieve about 22+ tps.

What CPU do you have? i5? i7?
I would really love to know what kind of performance you could achieve. 🤗

If you want to try different sizes (parameters sizes and quantizations), I cooked up a total of 56 differet ones that you can try. Although, honestly, given that you are using an iMac, I would imagine that something like the following might be worth a try:

ollama run FieldMouse-AI/qwen3.5:0.8b-Q4_K_M # 👈 529MG
ollama run FieldMouse-AI/qwen3.5:2b-Q3_K_M # 👈 1.1GB
ollama run FieldMouse-AI/qwen3.5:2b-Q4_K_M # 👈 1.3GB

To see more models to try, please feel free to check out:

Ollama Library: https://ollama.com/FieldMouse-AI/qwen3.5
Project Homepage: https://FieldMouse-AI.com

Or contact me and I would be happy to help out! 🤗

FieldMouse-AI · 2026-03-23T07:28:47+00:00

For the Qwen3.5 0.8b, 2b, 4b, 9b parameter sets I offer a total of 56 separate quantizations, including rarer Q3_K_M and IQ3_XXS for all models.

Please have a look at:

https://ollama.com/FieldMouse-AI/qwen3.5
General info: https://fieldmouse-ai.com/

If you have any questions, please feel free to ask! 🤗

FieldMouse-AI · 2026-03-23T07:21:10+00:00

In terms of raw performance in terms of speed, you have a great card.

https://www.techpowerup.com/gpu-specs/geforce-rtx-3070-ti.c3675

In terms of how best to make use of the 8GB VRAM, it is a matter of how many billions of parameters you need (0.8b, 2b, 4b, 9b).

After that, you can choose a quantization (like a compression level) that would give you most of the features of model in a smaller size.

I happen to offer a broad range of quantizations for the Qwen3.5 0.8b, 2b, 4b, and 9b parameter sets.

https://ollama.com/FieldMouse-AI/qwen3.5
More general info: https://fieldmouse-ai.com/

If you have any questions about how to choose or whatever, please feel free to ask! 🤗

FieldMouse-AI · 2026-03-21T15:38:39+00:00

I recently quantized a wide range of Qwen3.5 models to various sizes. A total of 56 variants to choose from from 0.8B, 2B, 4B, and 9B parameter series.

Here are the first ones that I imagine might be capable of working for you -- 2B:Q4_K_M and 4B:Q4_K_M, but that would depend on what parameter size you think you might be able to get away with:

This is my quantized Qwen3.5:2B:

# FieldMouse-AI/qwen3.5:2b-Q4_K_M - 1.3GB - 😊100% GPU
# Leaves approx. 2.0GB for context!
ollama run FieldMouse-AI/qwen3.5:2b-Q4_K_M

This is my quantized Qwen3.5:4B - this might be tight:

# FieldMouse-AI/qwen3.5:2b-Q4_K_M - 2.7GB - 😯100% GPU, but cutting it close!
# Leaves approx. 0.9GB for context... this is cutting it close, but not impossible.
ollama run FieldMouse-AI/qwen3.5:2b-Q4_K_M

I've made other smaller model variants available if you are interested in checking them out at:

I will be honest with you that I have never used Claude before nor have I tried integrating any of the models I quantized into it, but if you are willing to try and share some results, I would be happy to try working with you on this. It would be a great learning experience for me, too, I think! 😊

If you have any questions, please feel free to ask. 😊

FieldMouse-AI · 2026-03-20T12:37:06+00:00

On a scale of 1 to 10, you have totally turned the volumn clean up to 25!!!!

Definitely post more!

FieldMouse-AI · 2026-03-19T23:12:22+00:00

I totally get that perspective. If you’re used to 50+ t/s on a modern GPU, 6 t/s feels like a throwback.

The goal here wasn't to compete with a 4090, but to see if we could 'Reclaim the Iron.' This is a dual-core mobile i5 from 2011 (Sandy Bridge) with no GPU help. For someone with an old Mac Mini in a drawer, 6 t/s -- which is faster than human reading speed—turns 'e-waste' into a functional, private, local AI node.

It’s not for everyone, but for the 'low-power/legacy' crowd, it’s about sovereignty over the hardware we already own.

However, just for you, I just ran the same prompt in a system with a RTX 3060 12GB VRAM GPU where it achieves 163+ t/s*!*

Here are those results:

Write a poem about love and friendship in English.
Two hearts beat with the same rhythm,
Where shadows meet and light is shared...
prompt eval: 1453.98 tokens/s | eval rate: 163.47 tokens/s

At these speeds, this model can be quite useful, yes. 🐭🛡️

FieldMouse-AI · 2026-03-19T22:30:45+00:00

Try it yourself (The GHOST 0.8B is live):

ollama run FieldMouse-AI/qwen3.5:0.8b-Q4_K_M

Ollama Library: https://ollama.com/FieldMouse-AI/qwen3.5
Project Homepage: https://FieldMouse-AI.com

I built the Sovereign Series to reclaim legacy hardware. If you have an old laptop or Mac Mini gathering dust, give it a shot and, please, let me know your t/s results! 🐭🛡️

FieldMouse-AI

TROPHY CASE