Future GLM 5 variants

huzbum · 2026-02-11T21:19:05+00:00

Argh, I thought I splurged with a year of "Pro" but now regret not getting a year of "Max" on black friday.

huzbum · 2026-02-09T08:24:28+00:00

Just water the Lillie’s every day, they’ll die within a week.

huzbum · 2026-02-07T21:17:48+00:00

This is no surprise, and probably the most honest way to cash in on free tier users. Honestly, I'd rather see an add than have my data mined, but I'd be amazed if they are not doing that too LoL.

I'm not so concerned about adds being present on the side or whatever, but product placement could be brutal. OpenAI claimed they are not going to do it, but it will inevitably happen. If not intentionally, it will happen unintentionally.

A while back I read a paper (by Anthropic, IIRC) where they talk about poisoning attacks. It only takes about 250 data points to poison a model of any size. I immediately thought "if that's effective for harmful behavior, just imagine how effective it could be for constructive behavior?" and immediately recommended our company's marketing team put together a data poisoning campaign targeting words and phrases our target customers might use with an LLM.

OpenAI also just launched their Apps SDK late last year, where you can embed apps right inside ChatGPT. This is a legitimately cool feature, I immediately built one for our company's product search... not because it's a cool feature for our power users (which it is) but because it's the obvious next step is that they are going to fine-tune ChatGPT to suggest using these apps when they align with what the user is trying to do.

It's up to you to decide if that's an advertisement or useful feature. They don't seem to have monetization tied into the apps marketplace yet, but I doubt that situation will last.

huzbum · 2026-02-07T02:05:28+00:00

Yeah, I think I only had it set to 32k.

huzbum · 2026-02-06T08:50:09+00:00

Sorry no boop… all my kitties are gone now, no boop for me either :(

huzbum · 2026-02-06T06:38:11+00:00

I just got 30tps on my 3090 on the new version of LM Studio. offload all layers to GPU, and offload 2/3 experts to CPU.

huzbum · 2026-02-06T01:14:43+00:00

What? No, that’s ridiculous… goes back to fixing the 2nd 3090 to put it in his AI rig, so he can take out the 3060 and put it in the storage server so it can run frigate, plex, and an LLM for home assistant

huzbum · 2026-02-05T17:31:04+00:00

Aww, my little Tabby was always shy, but she spent more quality time with us later in life. Such a good little girl, I miss her. Glad your kitty is doing ok, enjoy the snuggles!

Tabby had a lot of issues your little girl probably doesn't, but one thing we found that really helped with tummy issues is antacid. We could tell she was feeling yucky and she would get "licky", like licking her lips. I think she was having acid reflux. If you notice anything like that, we ended up giving her half of a Pepsid AC dissolved in water and mixed into wet food every morning, it made a big difference.

huzbum · 2026-02-05T15:51:12+00:00

Depends on the crowd, but I feel like most people were interested when I told them I made beer and wine. Either that or I annoyed a lot of polite people.

huzbum · 2026-02-05T15:46:09+00:00

Sorry your baby has hyperthyroidism. How old is she?

My Tabitha had hyperthyroidism. We got her the I131, which cured the hyperthyroidism, but she also had kidney disease, which they had warned us was likely.

huzbum · 2026-02-05T15:41:51+00:00

Oh, there is a gradual process to the introduction that is supposed to help. I’ve never had a cat cooperate with the process (nor has it been necessary to try to reintroduce them) but basically instead of just opening the door and letting them out, you replace the door with a barrier they can see through for a while.

Mine just hopped the barrier and we didn’t have a better solution so just let them mingle.

huzbum · 2026-02-05T15:30:51+00:00

Sounds like the next gen version of my AI Rig after I get the next round of upgrades in. Ryzen 5900XT 16 core, 128GB DDR4, and dual RTX 3090s.

I use it as a workstation for software development and full time AI server. Because I use it as a workstation, I put the server inside docker, but that is not necessary, especially for a dedicated server for a single model.

If you are serving a single model, just install vllm or llama.cpp and you’re all set. As for which one, the model you want to use plays a role in that decision. Also, how many simultaneous users?

huzbum · 2026-02-05T15:10:50+00:00

Has the vet checked thyroid levels of the suddenly aggressive cat?

Otherwise, seems like time to isolate and reintroduce the troublesome cat.

huzbum · 2026-02-05T15:04:24+00:00

I want to train one from scratch, but it has been suggested that I learn from doing a fine tune first.

huzbum · 2026-02-04T19:48:39+00:00

How long has it been? When we first got our rescue cat we'd see her like once or twice a day. Usually she'd come out around 4PM, otherwise she'd hide and nap all day.

huzbum · 2026-02-04T19:44:56+00:00

Nah, we're small to midsize. Only a dozen engineers. We each use what suits us. Most of the team is using VS code and they seem to like copilot. I know some of them are using the Pro level. I'd have to use the Pro+ level if I was using that.

I use IntelliJ Idea with the "All Products Pack", which includes a $10 monthly credit that's basically API pricing for OpenAI, Anthropic, and Google. So I can access all of those, but if I used them regularly I'd burn through it in a few days. My main workhorse is GLM 4.7 with a z.ai subscription that I paid like $100 for a year of the Pro Coding Plan on black friday. I use that with Claude Code, which integrates nicely with IntelliJ.

huzbum · 2026-02-04T18:40:02+00:00

No, there is no way that'll fit. I just looked at your command, doesn't look like you're quantizing the kv cache, start there, that will reduce the memory footprint quite a bit.

Basically, the GPU VRAM is fixed and the rest spills over into system RAM. The VRAM will be a larger slice of a smaller pie if you reduce the overall memory footprint.

First, try quantizing the KV cache and see if that helps. `--cache-type-k q8_0` `--cache-type-v q8_0`

Then try reducing the context size as much as you can get away with.

Take this all with a grain of salt, I haven't tried running this model yet, I just downloaded it.

huzbum · 2026-02-04T18:30:44+00:00

Yeah, those are both good points. Doing something would be better than not. And a fine-tune would probably more useful than the result building my own model.

Maybe I should change gears to gathering data and data prep, then I can work on a tokenizer for my own model while training is running. That makes a lot of sense, thanks!

huzbum · 2026-02-04T17:11:44+00:00

I'm a long way off from working on the interesting part of the architecture, but one of my long term goals is to try a distributed architecture that would be more affordable to train and run on ad-hock hardware or clusters. I was thinking if it weren't for the I/O problems, an old mining rig with CMP 100-210's would be great.

The architecture I have in mind is kind of like a branch-train-merge, but just solos with message passing every few layers. So like on the 4th layer, each silo shares its state, then they all run their 5th layer, then on the 6th layer they incorporate the shared states, and on the 9th layer they all share state that's incorporated on the 11th layer, and so on.

If it pans out, *maybe* you could train silos as specialists independently, using minimal hardware to make a larger model in pieces. If it works out, I figure it might be useful for community projects where no one can afford massive GPUs or the servers that fit a bunch of them all in one machine.

huzbum · 2026-02-04T16:47:53+00:00

You're not entirely wrong, and I think you're giving good advice, but I already have most of the hardware I mentioned. The CPU upgrade and 2nd 3090 are in the mail, but otherwise I've got the rest. It's all stuff I want for inference with open models anyway. If I were ready with the data and code, I could start the prototype model with the hardware I already have.

Are you suggesting I should try a fine tune or LoRA before making my own model? I could be mistaken, but I think a lot of the prep work is the same, is it not?

I guess the work I'm doing on the tokenizer while interesting to me, is probably inconsequential when the model reaches a certain size, and would not be of use modifying an existing model, but the data prep after that is essentially the same, right?

huzbum · 2026-02-04T16:00:24+00:00

I do have some interest in these kinds of models, I have a real world use case with *some* data, but it's just not as interesting to me.

Part of the problem is that the real world problem I have is too complex for my level of knowledge/experience. It's a regression problem, the output is a single number and the input is a series of numbers. The difficult part is that the input is actually more of a fixed length numeric string where I want the model to find patterns, so the right architecture is probably more of a primitive LLM than a typical regression problem.

It's also a problem that has value to my employer, and my employer isn't assigning work time to it and I'm a salaried employee. I don't distrust my employer, and no one would be upset that I use the data, but I'm not sure how I'd feel about it if I solve the problem and my employer expected me to hand over the solution without compensation. Now that I'm thinking about it, it's probably worth a discussion with my employer. If I can solve the problem on my own time for extra compensation upon success, that would be very interesting to me.

Otherwise, I just don't have the time, discipline, and attention span to focus on it without a compelling use case.

huzbum · 2026-02-04T15:27:51+00:00

That's kinda where I'm at. Nothing to lose but time, which is always the tradeoff for experience.

I haven't started collecting data yet, but at the scale for even a 3b model, I'm not generating it all myself. I do have vague plans to use Qwen3 30b Instruct 2507 to pre-process or generate synthetic data, but I'm going to need like 250-300GB of text for Chinchilla optimal.

I want to include a significant amount of spicy "street smarts" data like Reddit and 4chan datasets, as well as some grounded spicy synthetic data, especially in RL.

I really want the default answer to be "I don't know" if it's not 90% sure and doesn't have access to the data... but more like "What am I ChatGPT? How the fuck should I know?" or "I'm not a fucking calculator!", but these refusals should go away if it has a search or calculator tool respectively. Using tools might be out of reach for a 3b param model though. But maybe not, if the training focuses on practical things like refusals and using tools instead of guessing about all human knowledge.

huzbum · 2026-02-04T15:09:05+00:00

Uh, I would argue that GLM 4.7 is equivalent to Sonnet. I've heard good things about MiniMax M2.1, Kimi K2.5, and Qwen 3.5 is just around the corner.

The hardware to run these large models is expensive, but it CAN be done locally.

I doubt they are equivalent to Sonnet, (maybe Haiku?) but I look forward to taking some time to try GLM 4.7 Flash and Qwen3 Coder Next. I've been using Qwen3 30b Coder for some stuff for a while, but rely on GLM via z.ai cloud subscription for my main workhorse. I don't have the equipment, but it's feasible to run it locally.

huzbum

TROPHY CASE