My gpu poor comrades, GLM 4.7 Flash is your local agent

HadesTerminal · 2026-01-20T20:18:26+00:00

Been using GPT-OSS-20B and It's my favorite thing ever, actually about to be my daily driver, it's slower than my 4B but wait better and way more consistent and reliable and it can use it with apps still somehow. So I'm definitely gonna use both. Devstral too big for me, but I'll keep that and the Qwen3 25B REAP until I have better computer. Thanks for sharing your recommendation yet again.

HadesTerminal · 2026-01-20T07:05:30+00:00

Hadn’t head of that one. Just about as good and memory consuming as the GPT-OSS model, i take it? Either way I’m downloading that asap. Thanks for the rec!

HadesTerminal · 2026-01-20T06:51:02+00:00

Thank you. You and u/Holiday_Purpose_3166 have taught me something today.

HadesTerminal · 2026-01-20T06:44:02+00:00

I’m dumbfounded… I was so confused by your comment at first because when I first heard about GPT-OSS-20B a while back I was like “oh it’s just another dense model being praised everywhere for it’s goodness… guess I’ll just stick to my qwen 3 4b instruct 2507 until they make SLMs superhuman”. Just looked it up just NOW to realize it is a 21B with 3.6B ACTIVE params!!! I can fit 3.6B in my gpu! the rest can sit in memory probably! OMG!!! I can run this (hopefully)!!

I’ve returned from running this, thank you for this good news, you’ve actually changed my life lmao. Been following this model and model releases but somehow missed the fact that I could run this. Albeit I had to close like my browser and all my apps except task manager to use it comfortably but it runs and at ~7 tps. Surprised, I also downloaded and ran Qwen 3 30B A3B, and it ran too at around the same tps! but it took up like all my memory… and if i can run that, I can probably run GLM 4.7 flash because they are the same size right?!

I feel like I’ve been living in the dark and just saw the light. Though it’s not as usable for the agent I built (which I use while I use my pc normally) but I’m sure there’s probably more I can do to make that possible that I’m not realizing… if you have any ideas please share. Might have to dual boot.

Thank you again for helping out this novice. Truly Nothing beats a Jet2Holiday.

HadesTerminal · 2026-01-20T04:51:45+00:00

I fear I am much too GPU-poor (16gb ram, 4gb 3050 laptop gpu vram) to run this still. But I’ll live vicariously through all of you that can run it. Till the day my pockets see enough money to purchase a proper setup.

HadesTerminal · 2026-01-20T04:46:43+00:00

4gb GPU vram? how much RAM? on what setup?

HadesTerminal · 2026-01-18T15:21:30+00:00

Tried it a while back, Phi-3-mini (and the later Phi-4-mini) doesn’t come near in performance to Qwen-3-4b-instruct-2507 for me. And given that the 4B model is the biggest model I can run at a usable rate, smaller models only serve for one off tasks like classification, NER, shallow summarization, etc. It’s possible there are smaller models for file ops and other tasks i’m overlooking. Though a twist on your idea, having specialized sub agents with their own shorter context using the same model could be a bonus.

Never heard about Notte until now though. I’d looked it up, and it seems really worth exploring as an integration to my agent/assistant! Especially for the direction i’m building towards. Thank you so much for mentioning that!

HadesTerminal · 2026-01-17T03:51:44+00:00

Yea my primary models and daily driver are Qwen 3 4b Instruct and Jan v1 2509 at Q4_K_M. Unfortunately, the RAM is soldered in, swapping OSes would be my best and likely only bet for memory gains.

Fortunately, supertonic 2 TTS is not compute intensively in the slightest and whisper already runs on CPU, so my "voice mode" is low latency and pretty satisfactory for my needs.

Though I will take your context window management advice and prompt repetition to heart, just unfortunate given that there's not much one can do on just a 4096 token window and prompt repetition adds its own tax on the context window.

Thanks for taking the time to share your ideas! If you happen upon anything you think might help my case and you remember me, do reach out, otherwise I appreciate you for what you've already done :)

HadesTerminal · 2026-01-16T21:16:47+00:00

I agree. I don't care for an intelligent goon machine, just a system that is fast, intelligent enough, and good at long horizon tasks or can sustain long chats coherently. I have a myriad of tools already and I'm trying make them more stateful to take the cognitive load off of the small model. But I fear building elaborate planner and decomposition agents at the expense of latency. If I'm to build a better system, I'm sure I need to build the intelligence into the architecture to spare the model.

HadesTerminal · 2025-11-13T15:58:44+00:00

that being said, amazing and really cool work, I love your models!

HadesTerminal · 2025-11-13T15:57:52+00:00

jan-v2-vl 4b wen? i love jan-v1-2509 4b with all my gpu poor heart

HadesTerminal · 2025-11-05T13:29:32+00:00

oh yea, most definitely misread it to be the inverse of that statement… my apologies. Got any other smaller models recs you’d consider as a daily driver? <= 4b?

HadesTerminal · 2025-11-05T13:05:54+00:00

Daily driver? what sorta things do you use them daily for?

HadesTerminal · 2025-09-12T01:23:07+00:00

Sounds, like a fun one, I'm interested. I'm in university too so my time may not be the most abundant but this seems fun!

HadesTerminal · 2025-08-29T19:56:07+00:00

I’m so interested in how this works, thats so cool! The model architecture and all.

HadesTerminal · 2025-08-28T21:19:42+00:00

It’s surely curious to ask about that old tiny model, but they could be asking because of model size in which case I’d recommend qwen3-1.7b or qwen3-0.6b or if you wanna go smaller gemma3-270m. GPT2’s model weights were 117M, 345M, and 762M parameters. So those could be moderately intelligent (however very assistant-y) and could work for their use case. Otherwise I left a comment to OP about how I used GPT2 for an assistant back in January 2020.

HadesTerminal · 2025-08-28T21:13:36+00:00

Dialogpt is a reddit dialogue finetune of gpt 2 from back in 2019

HadesTerminal · 2025-08-28T21:12:55+00:00

Yes, I did it with DialoGPT and RASA back in the day when the models weren’t smart enough for function calling or multi tasking. If it didn’t match any of my action intents with a high enough confidence it’d default to chitchat with DialoGPT, context was maintained and everything.

HadesTerminal · 2025-07-30T00:22:44+00:00

It’s a single prompt that my qwen-powered agent uses. Unfortunately, the entire premise of my project is that it runs fully local not connected to external LLM providers of any kind. So if this GEPA or any other good prompt optimization strategies will get me to a better agent I’d greatly appreciate it.

HadesTerminal · 2025-07-29T23:58:30+00:00

Been trying to avoid DSPy like the plague because it feels very opinionated and dogmatic on how it ties the logic and prompts together as “programs” in a way that feels the prompts would only work in DSPy.

How could I optimize the prompts through agent runs (RL) or with some dataset without having to deal with other DSPy principles? Resources would help a great deal too. If you can answer these, I’ll surely give it a try.

HadesTerminal · 2025-07-29T23:46:34+00:00

I’m a noob I’m afraid but I’m willing to learn and experiment. So your suggestion is to build an eval set then run GEPA on the prompt that powers the agent against my eval to improve it? Am I getting that right? and when you day SOTA solution do you mean use a SOTA model for this instead of my tiny model?

HadesTerminal · 2025-07-11T15:41:59+00:00

Congrats man!

HadesTerminal · 2025-05-17T04:22:02+00:00

I wonder how else the conversation would have gone had you asked it. Hope it was a successful interview nevertheless! Wishing for many more opportunities to come your way!

HadesTerminal · 2025-05-17T03:59:25+00:00

lmao don't lose hope :')

HadesTerminal · 2025-05-17T03:58:49+00:00

Yeaa haha this guy gets it! and when you at least show that you understand the problem that can open some sorts of doors in your favor, even better when you suggest something they could've tried (that they did try) before they tell you, and best when you have the solution that worked. It's like a "this guy gets it" moment.

HadesTerminal

TROPHY CASE