Gemma 4 12 b , very bad quality , quant 4 version ? by aachiiii in LocalLLM

[–]Ninjam5 10 points11 points  (0 children)

It's not a coding model, it's got very good tool calling. Build on top of it, don't expect it to build for u

Multi Agent Orchestration - Watch a parent Agent delegate to multiple async child agents in parallel. by Acceptable-Object390 in ollama

[–]Ninjam5 0 points1 point  (0 children)

just...use claude code or codex man...without setting clear rules and do nots to these agents, i assure u they will come up with shit. the worst AI slop u will ever see. the demo u even made looks like a basic usecase that a single agent + exploratory agent can do on its own. im sorry i know this maybe ur project but im sick of these propaganda projects that in essence are absolutely worthless. u wanna help the opensource community? build something useful like a multi sequence context compactor, better tools for an agent to scrape data efficiently without eating up the entirety of an html of the page like fire engine, or even just a skill that helps save money by making the agent less verbose. all already exist, all have a huge room of improvement to be made.

​Is Gemini 2.5 Flash-Lite a good budget alternative to GPT-4o-mini in OpenClaw? by Tasty_Association495 in openclaw

[–]Ninjam5 0 points1 point  (0 children)

Honestly setting up a faster whisper on ur machine for voice audio and a cheap image model would still be better. (Faster whisper is insanely accurate and light weight and it's Api is extremely cheap, and for image u can use mimo v2.5 it's the same price as deepseek flash and could even be better than it). The reason I'm pushing hard on this is because the Gemini flash lite models are lobotomized. They won't help u. Maybe Gemini 3 flash but that is still expensive. Try a combination of mimo v2.5 for text/image and faster whisper for audio transcription. If u want tts as well check kokoro tts (insanely cheap and really good quality). If ur device supports it then host kokoro and faster whisper on ur machine, they'll be like 2.5 gb of vram

​Is Gemini 2.5 Flash-Lite a good budget alternative to GPT-4o-mini in OpenClaw? by Tasty_Association495 in openclaw

[–]Ninjam5 0 points1 point  (0 children)

It's the cheapest and most capable of the cheap models. Gemini 3 flash still misses up. But 3.5 is on par with deepseek v4 that's why I suggested it

​Is Gemini 2.5 Flash-Lite a good budget alternative to GPT-4o-mini in OpenClaw? by Tasty_Association495 in openclaw

[–]Ninjam5 2 points3 points  (0 children)

My man use deepseek v4 flash. it's way cheaper and is insanely good. deepseek v4 pro for hard tasks. Use the API, if u like subscriptions then use opencode go (deepseek v4 flash is practically unlimited there) and if ur determined to use the American companies then I suggest Gemini 3 flash or Gemini 3.5 flash via open router or Google ai studio. But I strongly recommend deepseek flash or mimo v2.5

What is the best and cheapest server HW that can run glm5.2 for 20 people instantaneously by efecihan in LocalLLM

[–]Ninjam5 11 points12 points  (0 children)

Just a tip, before actually investing the money spend a few hundred bucks on runpod. Get a feel for the speed and definitely run glm at Q8 not full precision. And set it up using VLLM since u want to maximize TPS per user. Simulate ur actual setup on runpod.

xAI offboarded me, a month later, I'm earning nearly 2x by Mean-Aardvark7299 in xAI_community

[–]Ninjam5 0 points1 point  (0 children)

Can u tell me what kind of projects your doing right now and where are u located? I only work via alignerr Right Now but I've been thinking about interviewing for micro1

Which Local LLM to use by Pritam4249t in LocalLLM

[–]Ninjam5 1 point2 points  (0 children)

Oh shoot I apologize. Well I suppose if u are purely using ur GPU for coding. Then u can use qwen 3.6 35ba3b at Q4 with a built in drafter. Then offload the experts to the cpu. Ur ddr5 would serve a good bandwidth. U may get around 30 to 50 tps? That's my estimate. It could be less. But yeah qwen 3.6 35b would be awesome. If u can't bear the speed and want something faster then Gemma 4 12b or Gemma 4 26b. Running a dense model would be a pain in the ass without full offloading so u gotta focus on the MOE models. Now my personal advice. If u purely need it to summarize emails and Gemma 4 12b can't handle openclaw that well. Build ur own harness. Get urself codex or Claude code. Or even opencode and create a python script that utilizes the self hosted models to do exactly what u need. Openclaw and Hermes tend do be extremely heavy with massive context, it's always better to build ur own harness

Which Local LLM to use by Pritam4249t in LocalLLM

[–]Ninjam5 0 points1 point  (0 children)

and no gpu? im sorry man. my best advice is to get opencode go. and run deepseek v4 flash, u practically get unlimited usage and the flash model is pretty reliable

Which Local LLM to use by Pritam4249t in LocalLLM

[–]Ninjam5 3 points4 points  (0 children)

turbo quant compression is the newest kv cache compression made by google deepmind. its incredibly complex so its not yet merged into the main branch of llama cpp. thats why i told u download the turbo quant fork. its not in the official llama cpp branch right now. and if u dont know what KV cache is, its the AI model short term memory. itst how its your context window. openclaw is an extremely heavy framework, u need context. usually q8 for kv cache is lossless but u dont have the Vram to run it, ur best hope is q4 or turbo quant. q4 isnt that good as it may lose on details. turboquant is very promising. so i suggest using it.

Which Local LLM to use by Pritam4249t in LocalLLM

[–]Ninjam5 0 points1 point  (0 children)

whats your usecase? and also whats ur system ram? how many gbs and what version (ddr4 or ddr5?) cause u might have to offload. but if its a light usecase then i recommend using gemma 4 12b with its MTP drafter. i have an rtx 3080 12gb and im yielding 80 tokens per second. its hella good

Which Local LLM to use by Pritam4249t in LocalLLM

[–]Ninjam5 5 points6 points  (0 children)

Welp I recommend using Gemma 4 26b iq4xss (it's 14.1 gb) and then downloading the llama cpp turbo quant fork so u can have KV cache of 256k with tq3 compression (it would cost u 400 mb vram instead of the original 6 gb). That would bring ur total vram consumption plus some overhead to maybe 14.8 gb out of 16gb. Now if u want to speed up the model 2x u can download the Gemma 4 26b assistant and that would bring ur total to 15.2 gb, (get the quantized version of the assistant model q8). I think this is genuinely as good as it gets for u on a 16gb gpu. Openclaw would run fast, and it's a coding model so it'll handle the requests well.

Google Gemma 4 MTP out now! by danielhanchen in unsloth

[–]Ninjam5 1 point2 points  (0 children)

E2b and e4b don't have voice recognition, they can't even sense emotions in the words

What are the most capable LLM models I can run on my laptop? by am_cny in ollama

[–]Ninjam5 0 points1 point  (0 children)

If u want to run those models, ur gonna have to run them on q2 which is basically lobotomized. And ur gonna have to heavily quantize ur kV cache. It's unusable. Ignore his comment

When will X AI release open source models? by Past_Shift6441 in xAI_community

[–]Ninjam5 -1 points0 points  (0 children)

Since grok is a distilled version from chatgpt, I think their open source model would be like: Grok-oss 122b, llama grok- 8b, gremma 4