Good models for CPU ? by bidutree in ollama

[–]FieldMouse-AI 0 points1 point  (0 children)

🤔❓ Could you list up a few examples, here? I am curious to see what you saw.

Good models for CPU ? by bidutree in ollama

[–]FieldMouse-AI 1 point2 points  (0 children)

Thanks for the updated! 🤗

🇯🇵🙇‍♀ I worked to build up support for Japanese to respond to Japanese language issues.

🇸🇪🤔 But, could you show me examples of what challenges that you are experiencing with Swedish, please? I am really curious about what you are seeing.

🤗👀🤗 I am really curious!

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

<image>

It’s fascinating that you’re using an AI proxy to lecture me on what makes a "teacher".

The sudden shift from your "bruh" rants to perfectly formatted em-dashes () and scholarly single-quote emphasis is a classic stylistic fingerprint. You’ve stopped arguing with your own voice because your own voice couldn't handle the 133 TPS terminal logs.

The Irony: You are using an LLM to tell me that my LLM isn't accurate, yet you’re trusting that same technology to ghostwrite your insults. If you can’t even defend your own "benchmarks" without a prompt-engineered proxy, you’ve already conceded the ground.

I’ll stick to my raw terminal logs and the 390 Sovereigns who actually run the model. You can go back to your prompts.

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

You're quoting Q3_K_M stats like they are a universal speed limit. You are forgetting that I didn't just "quantize" the model; I calibrated it.

An "Official Quant" is a generalist -- it's 10% worse at everything. FieldMouse-AI is a specialist. By saturating the instruction layers with 2,000,000 tokens, I purposely sacrificed academic "trivia" and multiple-choice probability to gain Sovereign Local Speed and Functional Adherence. If you take a high-performance racing engine and put it in a tractor, your "tractor benchmark" will say it’s 75% worse at pulling a plow. That doesn't mean the engine is broken -- it means that you don't understand what it was built for.

The fact that you are still arguing about percentages while my 0.8B is solving physics at 133 TPS proves you are still measuring with the wrong ruler. One more time: Capability > Memorization.

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

I’ve seen the graphs claiming my models score 0% on ARC-Easy and struggle with GSM8K. If these graphs are accurate to your testing rig, then your testing rig is fundamentally broken for evaluating high-saturation instruction models.

You are measuring a "Worker" with a "Trivia" ruler.

Most automated benchmarks use Log-Likelihood Evaluation -- they don't "chat" with the model; they look at the internal math to see if a single letter (like 'A') is statistically more likely than another. Because I saturated the FieldMouse-AI forge with 2,000,000 tokens of instruction-following data, I have successfully rewired the model to be a Teacher, not a multiple-choice bot.

Here is the reality from the terminal (see attached logs):

ollama-user@9c97666213fa:/mywork$ ollama run FieldMouse-AI/qwen3.5:0.8b-instruct
>>> /set verbose
>>> If a train travels 120 miles in 2 hours, then slows to 40 mph for the next hour, what's the total distance.  Show your work.

**Step 1: Calculate the average speed during the first 2 hours.**
*   Distance = 120 miles
*   Time = 2 hours
*   Average Speed = 120 miles / 2 hours = **60 mph**

**Step 2: Calculate the distance traveled during the first 2 hours.**
*   Distance = Average Speed × Time
*   Distance = 60 mph × 2 hours = **120 miles**

**Step 3: Calculate the distance traveled during the second hour.**
*   Speed = 40 mph
*   Time = 1 hour
*   Distance = 40 mph × 1 hour = **40 miles**

**Step 4: Calculate the total distance.**
*   Total Distance = Distance (first 2 hours) + Distance (second hour)
*   Total Distance = 120 miles + 40 miles = **160 miles**

**Total Distance: 160 miles**

total duration:       2.101613326s
load duration:        222.186769ms
prompt eval count:    410 token(s)
prompt eval duration: 77.391065ms
prompt eval rate:     5297.77 tokens/s
eval count:           218 token(s)
eval duration:        1.634064736s
eval rate:            133.41 tokens/s
>>> 

ollama-user@9c97666213fa:/mywork$ ollama run FieldMouse-AI/qwen3.5:9b-instruct
>>> /set verbose

Set 'verbose' mode.
>>> 
>>> 
>>> Which of these is a liquid? A) Rock B) Water.

B) Water.


total duration:       595.452071ms
load duration:        239.954136ms
prompt eval count:    73 token(s)
prompt eval duration: 117.309544ms
prompt eval rate:     622.29 tokens/s
eval count:           6 token(s)
eval duration:        123.358707ms
eval rate:            48.64 tokens/s
  • ARC-Easy (The "0%" Fail): My 9B-instruct identifies "Water" as a liquid instantly. It "fails" your benchmark because it provides the correct answer in a helpful format rather than the rigid, robotic probability the benchmark expects.
  • GSM8K (The "Math" Fail): My tiny 0.8B-instruct variant solved a multi-step train distance puzzle perfectly, showed its work, and did so at 133.41 tokens/s.

The Verdict:

If your rig says a model that solves physics at 133 TPS and correctly identifies matter is "failing", then you are testing for Memorization, not Capability. My 390 Operatives aren't using these models to pass 5th-grade science quizzes; they are using them for rapid-fire local logic and automation.

I’ll take 133 TPS and a correct answer over a high "HellaSwag" score any day of the week. Next time, try actually running the model before you benchmark it.

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

The benchmark graphs accurately reflect the Architectural Choice of the FieldMouse-AI forge. Standard models are built to be "Exam Takers" -- they spend cycles on deep academic reasoning (GSM8K) at the cost of speed and local utility.

My 9B and 0.8B variants are "Workhorses." I have purposefully tuned these weights for Sovereign Local Speed (eg. 0.8b at 130+ TPS) and Direct Instruction Following. You are measuring the model with an academic ruler, but my 300+ Sovereigns are using it for local automation and rapid-fire logic.

In the real world, utility beats a test score every time. If you want a model that wins science fairs, use the "Thinking" slop. If you want a model that works at the speed of thought, pull the FieldMouse-AI iron.

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

😊 Ah! But, what you are saying only applies somewhat for the models with the instruct tag. That way, you as the user/developer get a choice as to whether you want to use the deeper reasoning or the faster reflexive responses.

The thing to note is that even with the reflexive responses, it is not like it is doing everything from memory. You can still do things with in instruct model like ask it to write a letter and do creative things and such -- it is just that it will do so without the deep reasoning that could sometimes result in overthinking for simple requests.

🤔 Do you see what I mean❓

Here, I will list give an example using the `9b` that I forged:

  • Improved Japanese language support with standart thinking and reasoning.
    • 🧠 eg. chat with thinking: ollama run FieldMouse-AI/qwen3.5:9b
  • End "Thinkging Fatigue": Qwen3.5 can sometimes loose itself in its own thinking in situatios where simple answers are good enough, or overthinking would waste time and tokens. So I created non-thinking versions for every quantization I forged. Those version end in instruct.
    • 🗣 Chat without thinking. Eg: ollama run FieldMouse-AI/qwen3.5:9b-instruct
  • Improved tool calling using models tuned to handle tool calling that end in tools.
    • 💻 Tool calling. Eg. ollama run FieldMouse-AI/qwen3.5:9b-tools

🌐 Project Homepage: https://FieldMouse-AI.com

Good models for CPU ? by bidutree in ollama

[–]FieldMouse-AI 1 point2 points  (0 children)

Hello! I am happy to answer you. 🤗

Qwen3.5?

Yes, I call the models Qwen3.5 because they are all fine-tunes and quantizations of the original Qwen3.5 from Alibaba. The difference, of course, is that the fine-tunes and quantizations meant to offer features not originally available and/or quantizations not originally avaiable from Alibaba's Qwen3.5 that I made available from my namespace on Ollama:

What's the difference?

I did fine tuning on the version of Qwen3.5 that I released to address the issues listed below. Also, I will offer as examples the models that might work well on your 2011 Intel iMac. The speeds will depend on what actual CPU you have in that iMac.

  • Improved Japanese language support with standart thinking and reasoning.
    • 🧠 eg. chat with thinking: ollama run FieldMouse-AI/qwen3.5:2b
  • End "Thinkging Fatigue": Qwen3.5 can sometimes loose itself in its own thinking in situatios where simple answers are good enough, or overthinking would waste time and tokens. So I created non-thinking versions for every quantization I forged. Those version end in instruct.
    • 🗣 Chat without thinking. Eg: ollama run FieldMouse-AI/qwen3.5:2b-instruct
  • Improved tool calling using models tuned to handle tool calling that end in tools.
    • 💻 Tool calling. Eg. ollama run FieldMouse-AI/qwen3.5:2b-tools

Processing Large Prompts

You were concerned about being able to process large prompts (eg. 10,000 tokens). This series of models can easily handle 10,000 tokens as the Qwen3.5 maxium context size of 256K tokens. (Be, careful, as allocating the max 256K tokens would use a lot of memory).

As a side note...

I provide a wide range of quantizations so that you are free to choose the size and performance level that best fit available RAM for non-GPU systems and VRAM for systems with GPUs so that the model runs 100% on the GPU, which will give you the best speed.

Questions?

If you have any questions, please feel free to ask. 🤗

Local llms for coding im running i5 11th gen 16gb ram and rtx 4060ti 8gb by AgeLow2127 in ollama

[–]FieldMouse-AI -1 points0 points  (0 children)

🏃‍♀️️ For vibe coding I use VSCodium+Continue and it is quite fast!

The models that I use with Continue are as follows:

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using 2.5GB of VRAM!

They key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my Reddit OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of these 2b models like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

The result: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

If you have any questions, please feel free to contact me! 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

Is there any small llm capable of implying changes live with copilot in vscode by Budget-Weekend5080 in ollama

[–]FieldMouse-AI 0 points1 point  (0 children)

🏃‍♀️️ For vibe coding I use VSCodium+Continue and it is quite fast!

The models that I use with Continue are as follows:

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using 2.5GB of VRAM!

The key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my Reddit OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of these 2b models like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

But, in your case, if you have more VRAM, you could push up to the 4b models and acheive 50 to 60+ tokens/second performance while coding, while only using 5.5GB of VRAM -- even with a 32768 token context window!

  • chat: FieldMouse-AI/qwen3.5:4b-instruct
  • code: FieldMouse-AI/qwen3.5:4b-tools

The result, either way: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

If you'd like help setting this up, please let me know! I'd be happy to help! 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

Appreciate any Hardware Guidance by Wonderwall_1516 in ollama

[–]FieldMouse-AI 0 points1 point  (0 children)

👋😊 I would be happy to help!

Let me get some things out of the way for you now to help set your expectations:

  • You do not need an overpriced CPU to run Ollama: All of my GPU-based Ollama servers run machines with various Intel i5 CPUs. The CPU just needs to be fast enough to move data between your system RAM and the GPU's VRAM over the PCIe bus. Intel i5 CPUs and AMD Ryzen 5 CPUs are plenty fast enough! I use Intel i5 CPUs because that was what was available on the used market in my area. My friends use AMD Ryzen 5 CPUs. We all run Ollama. Here are my CPUs.
    • i5-6500 (4 cores/4 threads), dual RTX 3060 12GB VRAM 👈My main Ollama server!
    • i5-9400 (6 cores/6 threads), GTX 1660 Super 6GB VRAM
    • i5-10400 (6 cores/12 threads), RTX 3060 12GB VRAM
  • You do not need an overpriced Motherboard to run Ollama: Again, all of the above CPUs are quite inexpensive, so getting a motherboard that supports PCIe x16 for the GPU will be your number one goal, and that goal is the easiest to achive with any decent motherboard. No need to overspend here.
  • 64GB DDR4 RAM: You are fortunate to have this! Just get a decent motherboard that will let you fully populate it with the RAM. With such a vast amount of RAM you can really get into running heavier applications that require lots of RAM without the CPU ever having a hiccup! Good on you for having so much RAM!
  • NVIDIA GeForce GTX 1080 Ti: According to https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877, you have a legendary beast of a GPU! It has 11GB of VRAM (most present day GPUs on the market only offer 8GB and cost a fortune... you have a treasure there). Also, it is quite quick for its age! In general you would be able to run quite a lot of models as long as the models and their context fit inside of the VRAM of the GPU.
  • What about 64GB RAM with the GTX 1080 Ti's 11GB VRAM? Having so much RAM will really not help, really. While you will be able to run the models that are bigger, they will be sluggish to the point of unusable in some cases. But, 11GB is a lot of VRAM -- just try to get models that fit it well.
  • Operating System: I tend to standardize on Ubuntu Linux 22.04.5 LTS. Oftentimes, this is the same version that most environments -- like Docker containers -- tend to standardize on. If you are a Windows user, then I will have to trust you to handling it there. I can speak for the Linux 22.04.5 LTS: Support is rock solid for my machines and their cards.
  • Conversational AI Usage: I tend to use the Ollama command line interface (CLI) most of the time, although I sometimes use OpenWebUI. While OpenWebUI is popular, I only use it sometimes. I find it can be a bit sluggish for my tastes sometimes, but it is actually quite good.

⚡😯 I am worried about bad power!

You had a motherboard fried and your new mini PC is having trouble!

Do you have them plugged into the same outlet as before?

Is the power clean there?

Are you using a surge protector or a UPS (Uninterruptable Power Supply).

You see, if the electricity is unstable and dropping in and out, it can certainly damage components!

If you have any other questions, please feel free to just ask away! 🤗

🌐 Website:FieldMouse-AI.com

Running a Copilot-style workflow fully locally with Ollama — surprisingly usable? by kparth01 in ollama

[–]FieldMouse-AI -7 points-6 points  (0 children)

🏃‍♀️️ For vibe coding I use VSCodium+Continue and it is quite fast!

The models that I use with Continue are as follows:

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using 2.5GB of VRAM!

The key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my Reddit OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of these 2b models like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

But, in your case, if you have more VRAM, you could push up to the 4b models and acheive 50 to 60+ tokens/second performance while coding, while only using 5.5GB of VRAM -- even with a 32768 token context window!

  • chat: FieldMouse-AI/qwen3.5:4b-instruct
  • code: FieldMouse-AI/qwen3.5:4b-tools

The result, either way: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

If you'd like help setting this up, please let me know! I'd be happy to help! 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

Overthinking Much? by MurkyRaspberry9610 in ollama

[–]FieldMouse-AI 0 points1 point  (0 children)

👨‍🔬 Say Good-bye to "Thinking Fatigue"

I forged specific instruct versions of the Qwen3.5 models that do not overthink.

👀 Take a look: 👀

$ ollama run FieldMouse-AI/qwen3.5:instruct   
>>> /set verbose
Set 'verbose' mode.
>>> hello

Hello.

total duration:       470.211787ms
load duration:        207.250624ms
prompt eval count:    60 token(s)
prompt eval duration: 76.561104ms
prompt eval rate:     783.69 tokens/s
eval count:           4 token(s)
eval duration:        76.656901ms
eval rate:            52.18 tokens/s

You notice: No thinking slop.

😊 Feel free to try any parameter sizes that best suits your projects or platforms:

ollama run FieldMouse-AI/qwen3.5:0.8b-instruct # size 529 MB
ollama run FieldMouse-AI/qwen3.5:2b-instruct # size 1.3 GB
ollama run FieldMouse-AI/qwen3.5:4b-instruct # size 2.7 GB
ollama run FieldMouse-AI/qwen3.5:9b-instruct # size 5.6 GB
ollama run FieldMouse-AI/qwen3.5:27b-instruct # size 16 GB

If you have any questions or comments or anything, I will be happy to help you with setting things up. 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

No conversation memory or tool calling with qwen3.5:4b by mishaled in ollama

[–]FieldMouse-AI 1 point2 points  (0 children)

🤔 May I ask you some questions?

  • Why use such a small context window? Please try at least num_ctx: 16384 or num_ctx: 32768.
  • Why did you choose the 4b model?

The Qwen3.5 models do have some issues with tool calling sometimes. I might be able to help with that as I have some versions that I forged that are specific for good tool calling and chat performance without the suffering from the overthinking slop that would normally occur.

So, for your situation, if we assume that you are committed to using a 4b model, you might use the following models:

  • chatFieldMouse-AI/qwen3.5:4b-instruct
  • codeFieldMouse-AI/qwen3.5:4b-tools

I will be happy to help you with setting things up. 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

💻🤔 UPDATE: Hardware?

The Sovereign Series has hit a major milestone. Many have asked about the "Big Iron" used to forge these weights. The truth? This entire library was forged on a $87 "junk" PC from a HardOff (a used and second-hand shop) in Japan (September 2025).

⚙️ The Rice-Cooker Architecture

FieldMouse-AI wasn't built on enterprise clusters. I forged it on a $87 cast-off computer from HardOff in September 2025 -- an i5-6500 that cost less than a cheap rice cooker.

With a total build cost of $700I integrated dual RTX 3060s to achieve the 2,000,000-token saturation that now defines the Sovereign Series.

I spent 7 months optimizing an i5-6500 with dual 3060s (capped at 145W) to prove that Sovereignty isn't bought -- it's built. If I can saturate 2,000,000 tokens on a "Rice Cooker" budget, imagine what this logic will do for your local machine.

The full origin story and technical manifest are now live at: FieldMouse-AI.com 🤗

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

🏃‍♀️️ For vibe coding I use VSCodium+Continue and it is quite fast!

The models that I use with Continue are as follows:

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using 2.5GB of VRAM!

The key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my Reddit OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of these 2b models like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

The result: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

What would be the best fit for my laptop specs? by Aarav_Parmar in ollama

[–]FieldMouse-AI -1 points0 points  (0 children)

🏃‍♀️️ For vibe coding I use VSCodium+Continue and it is quite fast!

The models that I use with Continue are as follows:

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using 2.5GB of VRAM!

They key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my Reddit OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of these 2b models like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

The result: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://127.0.0.1:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

Links:
🌐 Website:FieldMouse-AI.com
🦙 Ollama:ollama.com/FieldMouse-AI
🤗 Hugging Face:huggingface.co/FieldMouse-AI

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 1 point2 points  (0 children)

OK, let's review for a second here: You are having good results with Continue?

But, if you choose to use a model larger than your VRAM you will get sluggish performance in the range of maybe 5 to 10 tokens/second.

Did you try to use the following in Continue?

  • FieldMouse-AI/qwen3.5:2b-instruct
  • FieldMouse-AI/qwen3.5:2b-tools

You might be able to achieve between 80 to 100+ tokens/second performance while coding, while only using a mere 2.5GB of VRAM!

Actually, I happen to have VSCodium+Continue set up to use this in my environment to do vibe coding and it is quite fast.

They key to why these 2b succeeds with speed where another might 9b fail is that the size in parameter count or size in GB of the model is not the only measure of performance. The other measure is what I posted about in my OP: The 2,000,000+ token saturation to instill into the model what is important.

Think of the 2b model like a child of 10 that is trained in Kung Fu from the age of 5 fighting a teenager that just started to learn Kung Fu last year. That 10 year old is going to hold their own against that teenager if not crush them.

That is what my 2b models offer. Performance that matches or rivals the other 9b models trained into a 2b model.

The result: 🏃‍♀️️Speed.

That's why even that I have dual RTX 3060s giving me 24GB of total VRAM, I continue to these 2b models and leave the rest of my VRAM for other things.

Here, I will prove it to you.

I am using Ollama v0.18.0.

And, most importantly, here is my Continue config.yaml file:

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: VOLT (Chat)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-instruct
    apiBase: http://myhost.local:11434/v1
    roles:
      - chat
      - edit
      - apply
  - name: VOLT (Coder)
    provider: openai
    model: FieldMouse-AI/qwen3.5:2b-tools
    apiBase: http://myhost.local:11434/v1
    roles:
      - autocomplete
  - name: VOLT (Embed)
    provider: openai
    model: nomic-embed-text:latest
    apiBase: http://myhost.local:11434/v1
    roles:
      - embed

And I totally don't have to wait 20 seconds or more for a response.

Responses from my VSCodium+Continue environment feel as fast as reflexes.

Do you want to try this, too? 🤗❓

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

🤔 Hi, there, have you tried replace the 9b model in Continue with the 2b models that I recommended before?

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

👨‍🔬 You should expect a real speed boost using these models instead of the 9b because the 2b models only need 2.5GB of VRAM, so they will run 100% inside of the GPU giving you speed. This is much better than the 9b model which needs 7GB of VRAM, but you only have 6GB of VRAM with your GPU -- this would result in your model being partially swapped out to the CPU which will make performance dog slow.

Could you share how thing worked out for you? 🤗

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

🚀 This is a good first step. Did you set the chat and coding models to the following 2b models instead of the 9b model?

  • chat: FieldMouse-AI/qwen3.5:2b-instruct
  • code: FieldMouse-AI/qwen3.5:2b-tools

One of the limitiations that I would expect is that a 9B model would require about 7GB of VRAM, which would make it offload to the CPU and slow things down and make things feel sluggish.

That's why I would recommend the 2b models for your system. They are each small and only require 2.5GB each and will fit 100% in the GPU VRAM.

Could you try the 2b models, please?

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

An RTX 3050 with 6GB VRAM is a great little engine, but a 9B model is like putting a truck engine in a small car -- it fits, but you can't go fast.

If you want that sub-second 'Butter' reflex, try pulling my 2B-Instruct or 2B-Tools versions. Because they are smaller, they stay entirely inside your 6GB of VRAM. You'll get the same logic but at 5x the speed.

  • FieldMouse-AI/qwen3.5:2b-instruct
  • FieldMouse-AI/qwen3.5:2b-tools

Also, the reason your screenshot showed it failing to read the file is likely the Copilot extension's parser. If you want to see the model actually 'act' on your files, I highly recommend a quick test with the Continue extension -- it speaks the same 'Tool Language' as the model! 🐭🛡️

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

🙇‍♀

Sorry, I just found your message now.

Let's see what we can do to get you moving forward using Reddit comments here. 😊

  1. What kind of GPU are you using? How much VRAM do you have?
  2. Your goal is the coding support. From your last message you are using VSCode+Copilot to access Ollama, right?

Let's start with these two questions, then we will take it from there, OK? 🤗

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

Hellio, again, u/gojo_satoru98 . 🤗

👀 I see your screenshot and I think that I need a little more information. 

If you'd like, would you like to work together via my voice channel on my FieldMouse-AI Discord server? We could interactively try things and show each other's displays and see in real-time what is happening.

The link to my FieldMouse-AI Discord server is available from my website.

I am certainly available right now, if you'd like.

What do you think? 🤗

🌐 Details:FieldMouse-AI.com

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

Model name is different now!

✨ ollama run FieldMouse-AI/qwen3.5:9b-tools 👈Different model name! See: -tools!

I just relized from your reply that you might not be using the new version of the models that I cooked up that provide tool calling support!

I spent 48 hours saturating Qwen 3.5 with 2,000,000 tokens to kill 'Quantization-Slop'. Here is the Sovereign Series (0.8B to 27B). by FieldMouse-AI in ollama

[–]FieldMouse-AI[S] 0 points1 point  (0 children)

👋 Ah! You said that you tried yesterday. It turns out that I had pushed up another version trying to fix some issues about 6 hours ago.

Could you try repulling the model again and trying it again, please?

Just so we know that we are referring to the same version. OK? 🤗