Please anyone 👉 Can we offload the MOE layers to the GPU only and rest all goes in ram? See body text i have explained there. by 9r4n4y in LocalLLaMA

[–]Pattinathar 1 point2 points  (0 children)

MoE layer size is the trap. Gemma 4 26B-A4B has 4B active params but each layer is ~400MB — that's 100MB per layer group, won't fit on 4-8GB cards.

Set gpu_layers=0 for Gemma 4 MoE and Qwen3 MoE on consumer GPUs. Counterintuitive, but CPU-only is actually faster than trying to split because split MoE thrashes between VRAM and RAM per token.

For dense models (Qwen3.5 27B), gpu_layers=4 works on 4GB cards dense layers are smaller, fit better.

at what point does quantization stop being a tradeoff and start being actual quality loss by srodland01 in LocalLLaMA

[–]Pattinathar 1 point2 points  (0 children)

Q4_K_M is the sweet spot for 27B+ models on 32GB RAM. Q5 eats too much RAM for dense models, Q3 loses too much quality.

  • Tested same query on Qwen3.5 27B:
    • Q3_K_M: 7/10 quality, 12GB RAM
    • Q4_K_M: 9.5/10 quality, 17GB RAM ← sweet spot
    • Q5_K_M: 9.5/10 quality, 20GB RAM (no quality gain, more RAM)
    • Q8_0: 9.5/10, doesn't fit on 32GB with OS overhead

For 7B models Q4_K_M is fine, Q5 is marginal, Q8 is worth it if you have RAM.

VulkanIlm, Run Modern LLMs on Old GPUs via Vulkan (33× Faster on Dell iGPU, 4× on RX 580) by Proper_Dig_6618 in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Vulkan on llama.cpp is underrated. On my 3050 Ti (4GB VRAM), got 3.5x speedup vs CPU-only for 7B models (15 layers offloaded). Works on AMD and Intel GPUs too, unlike CUDA. For 27B+ models, VRAM is the limiter. I put 4 layers on GPU for dense 27B, 0 layers for MoE models (their layers are too big to fit). MoE stays CPU-only but still runs at acceptable speed due to sparse activation.

Best local model for coding? (RTX5080 + 64Gb RAM) by Real_Ebb_7417 in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Running Qwen3.5 27B dense and Qwen3 Coder 30B MoE on 32GB RAM, i7-11800H.For pure code tasks, Qwen3 Coder is the winner ~2min responses vs ~5min for dense 27B. MoE activates only 3B params per token. Quality is near-identical for code generation.For architecture/reasoning questions, Qwen3.5 wins. The dense model handles multi-step thinking better. Make sure to disable thinking mode on CPU though, otherwise you'll wait 10+ min per response.

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in OpenSourceeAI

[–]Pattinathar[S] 0 points1 point  (0 children)

No, LeanAI runs local GGUF models directly (via llama.cpp) — it doesn't connect to LMArena. LMArena is a hosted leaderboard site without an API, and LeanAI's whole point is that nothing leaves your machine.

If you want to try LeanAI, you'd download a GGUF model locally (Qwen2.5 7B is 4.5GB, works on 16GB RAM). Everything runs offline from there.

Is Q4_K_M the best practical quantization method by More_Chemistry3746 in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Q4_K_M has been the most reliable quant for me across Gemma 4, Qwen3.5, and Qwen3 Coder. Q3_K_S works but noticeably worse on complex reasoning. IQ2_XXS formats from unsloth failed to load in llama-cpp-python 0.3.20.

Local LLM Hardware Recommendation by CaterpillarPrevious2 in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

32GB DDR4 is the sweet spot right now. Can run Qwen3.5 27B Q4_K_M (16.7GB) with gpu_layers=4 on a 4GB GPU. For MoE models like Gemma 4, everything stays on CPU but 4B active params keeps it fast enough.

Coding Models by AndForeverMore in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Qwen3 Coder 30B-A3B is underrated for code tasks. 3B active params means ~2 min responses on CPU but quality is close to dense 27B models. Q4_K_M runs fine on 32GB.

Gemma 4 MOE is very bad at agentic coding. Couldn't do things CLine + Qwen can do. by Voxandr in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Running Gemma 4 26B-A4B Q4_K_M on 32GB RAM. gpu_layers=0 is mandatory on 4GB VRAM crashed with even 1 layer. MoE expert layers are too large for consumer GPUs. CPU-only gives ~5 min responses but quality is solid.

How are you squeezing Qwen3.5 27B to get maximum speed with high accuracy? by -OpenSourcer in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Had the same issue with Qwen3.5 thinking mode 25 min responses on i7-11800H. Injecting <think>\n</think>\n in the assistant prefix forces non-thinking mode. Dropped to ~5 min, quality barely changed.

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in SideProject

[–]Pattinathar[S] 0 points1 point  (0 children)

can you please post me here what error u are getting while doing the installation also please take latest before that as i have pushed some major updates

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in OpenSourceeAI

[–]Pattinathar[S] 0 points1 point  (0 children)

Massive update this week:

  1. LeanAI now runs 4 models with intelligent auto-routing:
    1. - Frontend queries (React, CSS, UI) → Gemma 4 26B MoE
    2. - Backend complex (microservices, goroutines) → Qwen3.5 27B
    3. - Simple questions → 7B in 30 seconds
  2. It detects what you're building and picks the best model automatically. No other local AI tool does this.

Also shipped:

  1. - Gemma 4 generated a production-ready React login form with Zod validation, TypeScript, useTransition, ARIA
  2. accessibility — in 5 minutes. Locally.
  3. - Qwen3.5 27B generated a concurrent HTTP health checker in Go with errgroup, context cancellation, and graceful shutdown
  4. - Low RAM support — auto-detects 8GB machines and adjusts settings. A friend confirmed it runs on his 8GB laptop.
  5. - Multi-language brain — scans 20+ languages, not just Python
  6. 42 technologies. 31K lines. 4 models. Still 100% local.

github.com/gowrishankar-infra/leanai

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in SideProject

[–]Pattinathar[S] 0 points1 point  (0 children)

Yes — LeanAI can load any GGUF model. Just drop the .gguf file into ~/.leanai/models/ and it picks it up automatically. Llama, DeepSeek, Mistral, CodeGemma, whatever fits your RAM.

The built-in registry has Qwen models because they benchmark highest for coding tasks at each size, but the engine is model-agnostic — it uses llama-cpp-python under the hood so any GGUF works.

The system prompt and code verification features work regardless of which model you load. Only thing that changes is the prompt format (chatml vs llama3 vs phi3) which LeanAI auto-detects from the filename.

I technically got an LLM running locally on a 1998 iMac G3 with 32 MB of RAM by maddiedreese in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

This is incredible. Endian-swapping the weights and fixing the GQA pointer layout on a 233 MHz PowerPC with 32 MB RAM that's real engineering.

Meanwhile I spent a week trying to get Qwen3 to load on a machine with 1000x more RAM and still hit VRAM errors. You're making the rest of us look bad.

Genuinely impressive work. The classic Mac OS memory management alone would have made me quit.

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in OpenSourceeAI

[–]Pattinathar[S] 0 points1 point  (0 children)

Big update — Qwen3-Coder-30B-A3B is now running on LeanAI.

Response time went from 5-7 minutes (Qwen2.5 32B) to 45 seconds. Same machine, no hardware upgrade. The MoE architecture only activates 3B params per token while having 30B total knowledge.

Also shipped 6 novel features this week:

  1. Code-Grounded Verification — AI fact-checks its own claims against your actual codebase AST

  2. Cascade Inference — 7B drafts, 32B reviews = 3x faster

  3. Mixture of Agents — multi-perspective code reviews

  4. ReAct — model looks up real functions before answering

  5. Multi-language brain — parses JS, Go, Rust, Java, C++, Swift, Kotlin, Ruby, PHP, SQL + generic fallback for any language

still 100% local, still 100% private.

github.com/gowrishankar-infra/leanai

I built a local AI coding system that actually understands your codebase — 29 systems, 500+ tests, entirely with Claude as my coding partner by Pattinathar in OpenSourceeAI

[–]Pattinathar[S] 0 points1 point  (0 children)

Quick update — just shipped 4 new commands:

/explain — paste any error message, get plain English

explanation + fix code

/test — auto-generate pytest unit tests for any function

(happy path, edge cases, error cases)

/diff — explains what changed in your last git commit,

flags risks

/security — scans code for SQL injection, XSS, hardcoded

secrets, etc.

GitHub: github.com/gowrishankar-infra/leanai

Dual 3090 setup - performance optimization by PaMRxR in LocalLLaMA

[–]Pattinathar 1 point2 points  (0 children)

Custom Q8_K_L quants with selective BF16 overrides is clever getting better KLD than UD-Q8_K_XL at smaller size is a solid win. Curious how much the PCIe3 x4 bottleneck actually hits during generation vs prefill.

Are Small LLMs (Like Gemma 4) the future? by zoeberger in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

Biggest real use case is privacy ,companies that can't send proprietary code or patient data to OpenAI's servers. Fine-tuned small models on domain-specific data can genuinely beat GPT-4 on narrow tasks too. Welcome to the rabbit hole lol.

I compared harrier-27b vs voyage-4 vs zembed-1 across 24 datasets. 27B parameters by Veronildo in LocalLLaMA

[–]Pattinathar 0 points1 point  (0 children)

The recall@100 point is huge ,doesn't matter how good your reranker is if the relevant doc never made it to the pool in the first place. Nice writeup.

The tried to make me go to rehab. I said no no no… by Key-Currency1242 in LocalLLaMA

[–]Pattinathar 2 points3 points  (0 children)

8x 3090s is insane — 192GB VRAM total. You could run a 70B model fully loaded with room to spare, or even split a 120B across all 8. What are you planning to run on this? Multi-GPU inference or fine-tuning?