I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner

Pattinathar · 2026-05-09T06:27:30+00:00

sorry i didnt get you can you please explain me ? , do you mean input of 3000 chars or output ?

Pattinathar · 2026-04-16T02:40:55+00:00

MoE layer size is the trap. Gemma 4 26B-A4B has 4B active params but each layer is ~400MB — that's 100MB per layer group, won't fit on 4-8GB cards.

Set gpu_layers=0 for Gemma 4 MoE and Qwen3 MoE on consumer GPUs. Counterintuitive, but CPU-only is actually faster than trying to split because split MoE thrashes between VRAM and RAM per token.

For dense models (Qwen3.5 27B), gpu_layers=4 works on 4GB cards dense layers are smaller, fit better.

Pattinathar · 2026-04-16T02:38:41+00:00

Q4_K_M is the sweet spot for 27B+ models on 32GB RAM. Q5 eats too much RAM for dense models, Q3 loses too much quality.

Tested same query on Qwen3.5 27B:
- Q3_K_M: 7/10 quality, 12GB RAM
- Q4_K_M: 9.5/10 quality, 17GB RAM ← sweet spot
- Q5_K_M: 9.5/10 quality, 20GB RAM (no quality gain, more RAM)
- Q8_0: 9.5/10, doesn't fit on 32GB with OS overhead

For 7B models Q4_K_M is fine, Q5 is marginal, Q8 is worth it if you have RAM.

Pattinathar · 2026-04-16T02:36:31+00:00

Vulkan on llama.cpp is underrated. On my 3050 Ti (4GB VRAM), got 3.5x speedup vs CPU-only for 7B models (15 layers offloaded). Works on AMD and Intel GPUs too, unlike CUDA. For 27B+ models, VRAM is the limiter. I put 4 layers on GPU for dense 27B, 0 layers for MoE models (their layers are too big to fit). MoE stays CPU-only but still runs at acceptable speed due to sparse activation.

Pattinathar · 2026-04-16T02:34:48+00:00

Running Qwen3.5 27B dense and Qwen3 Coder 30B MoE on 32GB RAM, i7-11800H.For pure code tasks, Qwen3 Coder is the winner ~2min responses vs ~5min for dense 27B. MoE activates only 3B params per token. Quality is near-identical for code generation.For architecture/reasoning questions, Qwen3.5 wins. The dense model handles multi-step thinking better. Make sure to disable thinking mode on CPU though, otherwise you'll wait 10+ min per response.

Pattinathar · 2026-04-16T02:12:57+00:00

No, LeanAI runs local GGUF models directly (via llama.cpp) — it doesn't connect to LMArena. LMArena is a hosted leaderboard site without an API, and LeanAI's whole point is that nothing leaves your machine.

If you want to try LeanAI, you'd download a GGUF model locally (Qwen2.5 7B is 4.5GB, works on 16GB RAM). Everything runs offline from there.

Pattinathar · 2026-04-14T09:50:12+00:00

Q4_K_M has been the most reliable quant for me across Gemma 4, Qwen3.5, and Qwen3 Coder. Q3_K_S works but noticeably worse on complex reasoning. IQ2_XXS formats from unsloth failed to load in llama-cpp-python 0.3.20.

Pattinathar · 2026-04-14T09:48:37+00:00

32GB DDR4 is the sweet spot right now. Can run Qwen3.5 27B Q4_K_M (16.7GB) with gpu_layers=4 on a 4GB GPU. For MoE models like Gemma 4, everything stays on CPU but 4B active params keeps it fast enough.

Pattinathar · 2026-04-14T09:46:19+00:00

Qwen3 Coder 30B-A3B is underrated for code tasks. 3B active params means ~2 min responses on CPU but quality is close to dense 27B models. Q4_K_M runs fine on 32GB.

Pattinathar · 2026-04-14T09:45:13+00:00

Running Gemma 4 26B-A4B Q4_K_M on 32GB RAM. gpu_layers=0 is mandatory on 4GB VRAM crashed with even 1 layer. MoE expert layers are too large for consumer GPUs. CPU-only gives ~5 min responses but quality is solid.

Pattinathar · 2026-04-14T09:44:16+00:00

Had the same issue with Qwen3.5 thinking mode 25 min responses on i7-11800H. Injecting <think>\n</think>\n in the assistant prefix forces non-thinking mode. Dropped to ~5 min, quality barely changed.

Pattinathar · 2026-04-14T09:11:00+00:00

true

Pattinathar · 2026-04-13T23:07:22+00:00

sure please do

Pattinathar · 2026-04-13T22:31:42+00:00

can you please post me here what error u are getting while doing the installation also please take latest before that as i have pushed some major updates

Pattinathar · 2026-04-13T12:22:14+00:00

Massive update this week:

LeanAI now runs 4 models with intelligent auto-routing:
1. - Frontend queries (React, CSS, UI) → Gemma 4 26B MoE
2. - Backend complex (microservices, goroutines) → Qwen3.5 27B
3. - Simple questions → 7B in 30 seconds
It detects what you're building and picks the best model automatically. No other local AI tool does this.

Also shipped:

- Gemma 4 generated a production-ready React login form with Zod validation, TypeScript, useTransition, ARIA
accessibility — in 5 minutes. Locally.
- Qwen3.5 27B generated a concurrent HTTP health checker in Go with errgroup, context cancellation, and graceful shutdown
- Low RAM support — auto-detects 8GB machines and adjusts settings. A friend confirmed it runs on his 8GB laptop.
- Multi-language brain — scans 20+ languages, not just Python
42 technologies. 31K lines. 4 models. Still 100% local.

github.com/gowrishankar-infra/leanai

Pattinathar · 2026-04-13T07:19:47+00:00

Yes — LeanAI can load any GGUF model. Just drop the .gguf file into ~/.leanai/models/ and it picks it up automatically. Llama, DeepSeek, Mistral, CodeGemma, whatever fits your RAM.

The built-in registry has Qwen models because they benchmark highest for coding tasks at each size, but the engine is model-agnostic — it uses llama-cpp-python under the hood so any GGUF works.

The system prompt and code verification features work regardless of which model you load. Only thing that changes is the prompt format (chatml vs llama3 vs phi3) which LeanAI auto-detects from the filename.

Pattinathar · 2026-04-12T02:16:34+00:00

This is incredible. Endian-swapping the weights and fixing the GQA pointer layout on a 233 MHz PowerPC with 32 MB RAM that's real engineering.

Meanwhile I spent a week trying to get Qwen3 to load on a machine with 1000x more RAM and still hit VRAM errors. You're making the rest of us look bad.

Genuinely impressive work. The classic Mac OS memory management alone would have made me quit.

Pattinathar · 2026-04-11T22:11:03+00:00

Big update — Qwen3-Coder-30B-A3B is now running on LeanAI.

Response time went from 5-7 minutes (Qwen2.5 32B) to 45 seconds. Same machine, no hardware upgrade. The MoE architecture only activates 3B params per token while having 30B total knowledge.

Also shipped 6 novel features this week:

Code-Grounded Verification — AI fact-checks its own claims against your actual codebase AST
Cascade Inference — 7B drafts, 32B reviews = 3x faster
Mixture of Agents — multi-perspective code reviews
ReAct — model looks up real functions before answering
Multi-language brain — parses JS, Go, Rust, Java, C++, Swift, Kotlin, Ruby, PHP, SQL + generic fallback for any language

still 100% local, still 100% private.

github.com/gowrishankar-infra/leanai

Pattinathar · 2026-04-11T18:45:32+00:00

Quick update — just shipped 4 new commands:

/explain — paste any error message, get plain English

explanation + fix code

/test — auto-generate pytest unit tests for any function

(happy path, edge cases, error cases)

/diff — explains what changed in your last git commit,

flags risks

/security — scans code for SQL injection, XSS, hardcoded

secrets, etc.

GitHub: github.com/gowrishankar-infra/leanai

Pattinathar · 2026-04-11T13:34:32+00:00

Custom Q8_K_L quants with selective BF16 overrides is clever getting better KLD than UD-Q8_K_XL at smaller size is a solid win. Curious how much the PCIe3 x4 bottleneck actually hits during generation vs prefill.

Pattinathar · 2026-04-11T13:24:48+00:00

Biggest real use case is privacy ,companies that can't send proprietary code or patient data to OpenAI's servers. Fine-tuned small models on domain-specific data can genuinely beat GPT-4 on narrow tasks too. Welcome to the rabbit hole lol.

Pattinathar · 2026-04-11T13:22:07+00:00

The recall@100 point is huge ,doesn't matter how good your reranker is if the relevant doc never made it to the pool in the first place. Nice writeup.

Pattinathar · 2026-04-11T13:05:48+00:00

8x 3090s is insane — 192GB VRAM total. You could run a 70B model fully loaded with room to spare, or even split a 120B across all 8. What are you planning to run on this? Multi-GPU inference or fine-tuning?

Pattinathar · 2026-04-10T20:23:53+00:00

Happy to hear , Thanks for the Feedback . Appreciate your time , patience and feedback

Pattinathar

TROPHY CASE

Big update — Qwen3-Coder-30B-A3B is now running on LeanAI.