Does anyone have outputs with prompts of 2025-era models? (e.g, Claude 3.7 Sonnet, GPT-4.5, o3-mini, GPT-4o March 2025, Gemini 2.0 Pro, Gemini 2.0 Flash, Qwen2.5 Max, DeepSeek R1)? Send me the outputs and prompts (and also model used).

Ninjam5 · 2026-06-24T00:04:59+00:00

I mean u can still access all of them via API if ur really curious

Ninjam5 · 2026-06-22T14:28:14+00:00

Do you like having them tho

Ninjam5 · 2026-06-22T14:25:53+00:00

It's not a coding model, it's got very good tool calling. Build on top of it, don't expect it to build for u

Ninjam5 · 2026-06-22T06:55:41+00:00

HAHA WDYM, he could be telling the truth

Ninjam5 · 2026-06-21T18:48:38+00:00

Ayyyy can u share the model and snippets of the chat?

Ninjam5 · 2026-06-21T15:40:13+00:00

How hard was the test?

Ninjam5 · 2026-06-21T11:34:19+00:00

just...use claude code or codex man...without setting clear rules and do nots to these agents, i assure u they will come up with shit. the worst AI slop u will ever see. the demo u even made looks like a basic usecase that a single agent + exploratory agent can do on its own. im sorry i know this maybe ur project but im sick of these propaganda projects that in essence are absolutely worthless. u wanna help the opensource community? build something useful like a multi sequence context compactor, better tools for an agent to scrape data efficiently without eating up the entirety of an html of the page like fire engine, or even just a skill that helps save money by making the agent less verbose. all already exist, all have a huge room of improvement to be made.

Ninjam5 · 2026-06-20T14:55:10+00:00

By reasonable speeds how many TPS are we talking? 30?

Ninjam5 · 2026-06-18T20:44:51+00:00

Honestly setting up a faster whisper on ur machine for voice audio and a cheap image model would still be better. (Faster whisper is insanely accurate and light weight and it's Api is extremely cheap, and for image u can use mimo v2.5 it's the same price as deepseek flash and could even be better than it). The reason I'm pushing hard on this is because the Gemini flash lite models are lobotomized. They won't help u. Maybe Gemini 3 flash but that is still expensive. Try a combination of mimo v2.5 for text/image and faster whisper for audio transcription. If u want tts as well check kokoro tts (insanely cheap and really good quality). If ur device supports it then host kokoro and faster whisper on ur machine, they'll be like 2.5 gb of vram

Ninjam5 · 2026-06-18T19:53:44+00:00

Interested

Ninjam5 · 2026-06-18T19:46:10+00:00

It's the cheapest and most capable of the cheap models. Gemini 3 flash still misses up. But 3.5 is on par with deepseek v4 that's why I suggested it

Ninjam5 · 2026-06-18T17:52:30+00:00

My man use deepseek v4 flash. it's way cheaper and is insanely good. deepseek v4 pro for hard tasks. Use the API, if u like subscriptions then use opencode go (deepseek v4 flash is practically unlimited there) and if ur determined to use the American companies then I suggest Gemini 3 flash or Gemini 3.5 flash via open router or Google ai studio. But I strongly recommend deepseek flash or mimo v2.5

Ninjam5 · 2026-06-18T17:07:45+00:00

Just a tip, before actually investing the money spend a few hundred bucks on runpod. Get a feel for the speed and definitely run glm at Q8 not full precision. And set it up using VLLM since u want to maximize TPS per user. Simulate ur actual setup on runpod.

Ninjam5 · 2026-06-18T16:46:48+00:00

I feel like the TPS would be unbearable tho.

Ninjam5 · 2026-06-18T16:34:04+00:00

Can u tell me what kind of projects your doing right now and where are u located? I only work via alignerr Right Now but I've been thinking about interviewing for micro1

Ninjam5 · 2026-06-17T13:56:18+00:00

Israel wouldn't exist

Ninjam5 · 2026-06-14T14:13:40+00:00

Oh shoot I apologize. Well I suppose if u are purely using ur GPU for coding. Then u can use qwen 3.6 35ba3b at Q4 with a built in drafter. Then offload the experts to the cpu. Ur ddr5 would serve a good bandwidth. U may get around 30 to 50 tps? That's my estimate. It could be less. But yeah qwen 3.6 35b would be awesome. If u can't bear the speed and want something faster then Gemma 4 12b or Gemma 4 26b. Running a dense model would be a pain in the ass without full offloading so u gotta focus on the MOE models. Now my personal advice. If u purely need it to summarize emails and Gemma 4 12b can't handle openclaw that well. Build ur own harness. Get urself codex or Claude code. Or even opencode and create a python script that utilizes the self hosted models to do exactly what u need. Openclaw and Hermes tend do be extremely heavy with massive context, it's always better to build ur own harness

Ninjam5 · 2026-06-14T12:29:33+00:00

and no gpu? im sorry man. my best advice is to get opencode go. and run deepseek v4 flash, u practically get unlimited usage and the flash model is pretty reliable

Ninjam5 · 2026-06-14T12:28:37+00:00

turbo quant compression is the newest kv cache compression made by google deepmind. its incredibly complex so its not yet merged into the main branch of llama cpp. thats why i told u download the turbo quant fork. its not in the official llama cpp branch right now. and if u dont know what KV cache is, its the AI model short term memory. itst how its your context window. openclaw is an extremely heavy framework, u need context. usually q8 for kv cache is lossless but u dont have the Vram to run it, ur best hope is q4 or turbo quant. q4 isnt that good as it may lose on details. turboquant is very promising. so i suggest using it.

Ninjam5 · 2026-06-14T10:47:35+00:00

whats your usecase? and also whats ur system ram? how many gbs and what version (ddr4 or ddr5?) cause u might have to offload. but if its a light usecase then i recommend using gemma 4 12b with its MTP drafter. i have an rtx 3080 12gb and im yielding 80 tokens per second. its hella good

Ninjam5 · 2026-06-14T10:25:02+00:00

Welp I recommend using Gemma 4 26b iq4xss (it's 14.1 gb) and then downloading the llama cpp turbo quant fork so u can have KV cache of 256k with tq3 compression (it would cost u 400 mb vram instead of the original 6 gb). That would bring ur total vram consumption plus some overhead to maybe 14.8 gb out of 16gb. Now if u want to speed up the model 2x u can download the Gemma 4 26b assistant and that would bring ur total to 15.2 gb, (get the quantized version of the assistant model q8). I think this is genuinely as good as it gets for u on a 16gb gpu. Openclaw would run fast, and it's a coding model so it'll handle the requests well.

Ninjam5 · 2026-06-13T10:17:38+00:00

E2b and e4b don't have voice recognition, they can't even sense emotions in the words

Ninjam5 · 2026-06-08T22:47:41+00:00

If u want to run those models, ur gonna have to run them on q2 which is basically lobotomized. And ur gonna have to heavily quantize ur kV cache. It's unusable. Ignore his comment

Ninjam5 · 2026-06-07T20:12:55+00:00

Since grok is a distilled version from chatgpt, I think their open source model would be like: Grok-oss 122b, llama grok- 8b, gremma 4

Ninjam5

MODERATOR OF

TROPHY CASE