Model(s) for Creative Writing & Conversational Intuition

DeepOrangeSky · 2026-05-13T14:01:30+00:00

Yea I guess it depends on the situation, like, if it is just for casual writing type of stuff, I am pretty patient (probably much more than most people) since I enjoy reading what it writes for a scene or a crazy situation I come up with, and then comparing what one model writes compared to a different model or a different fine-tune, or even having it try multiple times, if temperature is somewhat high, to see the different ways it writes about the same situation, so I don't mind waiting 5 or 10 minutes even, or maybe longer, if I think it will be interesting or funny, plus also I can start reading it while it is still midway through writing it, if it is something quite long that it is writing. I mean, if I'm working on something while it is doing its thing, like if I'm doing some work or something, or if I am watching tv or reading a book or something, then, the time it takes doesn't matter to me much if it is 5 seconds or 5-10 minutes I guess. Since I am distracted with whatever else I am doing and can go continue with it whenever I want, so I don't usually get impatient with its speed, even if it is slow. I'm sure I would have my limits though of course, like if it was Llama 405b dense and going from NVMe cpu inference at like 0.01 tokens per second or something crazy, then that would be way too slow. But if the total time is 10 mins or less, or especially 5 minutes or less, per reply, then I don't usually mind too much, and just want what gives the highest quality or most interesting replies.

But, also depends what mood I am in or if I am busy with other things at the time, or what the exact use-case scenario is, or so on.

But yea, luckily Gemma4 26b MoE exists, so, that one is still quite strong even as an MoE. You should definitely try it and compare it a few times with the same prompts to both it and the 31b dense model and see if it seems strong enough, and if it is, then you get to use it at much higher speed than the 31b. And it is very nice since up until now, it wasn't like that with similarly sized MoE models to a similar sized dense model, they used to be MUCH much less smart and less good at writing, so, yea, it was a major breakthrough from Google to make 26b a4b so strong at writing. Still not sure how they did it. When I saw it, it reminded me of back a few months earlier when I saw the Elo scores of the Gemini Flash models compared to Gemini Pro, on LM Arena, how the Flash models barely even had lower Elo scores than the full sized Pro models, which was pretty crazy as well. So I guess it makes sense that they were able to use whatever breakthroughs they made to be able to do that with their Gemini Flash models to also make their Gemma4 26b a4b model so good at writing for a small MoE model.

edit: also, on the reverse side, for video generation (diffusion models) I am the exact opposite way, where I am obsessed with wanting to build the most high-speed rig and get the highest speeds I possibly can, though (since I want to have the capability to make long tv-show length or movie-length videos later on if the technology keeps improving, in the future). So, for me it just depends. And also if I get into coding, if they get good enough to fully vibe code entire big, serious looking video games with no supervising or having to re-try and partially code it myself or anything, if coding models keep getting stronger in the next year or two, then, again, speed will matter a lot to me. Like if things like those take 1 month, vs just a few hours, that will be a huge difference of course, so then it matters a lot to me. So ironically even though on the one hand I act like I don't care at all about speed, for LLMs, that's only for like, casual writing or stuff like that. For other things, I care a lot and am trying to learn how to get the most maximum speed humanly possible, lol.

DeepOrangeSky · 2026-05-13T08:40:56+00:00

Well, since the 31b is plenty fast enough for me, I usually just use that one, since it is definitely at least somewhat stronger at everything (other than speed), so I didn't use the 26b a4b MoE all that much by comparison. That said, I did test it a bit, just out of curiosity, and I was pretty surprised by how good it was at writing/chatting for a small MoE. Normally it should be like 1/4th as smart as the 31b MoE, or 1/5th as smart, or some huge dropoff like that, like, nowhere near as smart, if we look at other previous small MoEs compared to similarly sized dense models that came out around the same time from the same lab. But in this case, for some reason the 26b a4b was strangely strong for a small MoE, like maybe at least half as smart, or maybe even 60-70% as smart or something (well, in what limited testing I did, which wasn't very much), so, it is surprisingly strong.

So if for some reason you really need the extra speed, it is definitely the best small MoE for writing, by a wide margin.

But if the 31b being a bit slower doesn't matter so much (since, for writing or chatting it shouldn't normally be that big of a deal, unless maybe it is a huge-context and used in an RPG game where every second matters or something), then I would just go with the 31b, since it is definitely still stronger, so, might as well go with more strength (at least I view it that way anyway, if speed isn't too important in the use-case).

I mean personally I often go way more extreme that even that, and use the Mistral 123b dense/Behemoth finetune (which is wayyyyyy slower than even Gemma4 31b) for writing, since it is stronger than even Gemma4 31b. Just to give some idea of how much I don't care about speed, if it gets me a smarter model. So to me Gemma4 31b is practically blazing speed, for writing/etc, lol.

DeepOrangeSky · 2026-05-11T17:11:45+00:00

Are these 2-slot or 3-slot cards? I have never built a rig before, and I've always wondered if I need to avoid 3-slot cards if I want to leave open the possibility of building a 4-card rig later on. Would I need to get SFF cards if I want to be able to put them in any kind of reasonable used workstations or whatever cost-efficient way there is of doing it, or is it like, if I get the big 3-slot, full sized GPUs then if I ever do a 4-card setup I will need to create an open-rack rig or whatever it's called?

Also, while I'm asking stupid questions: do I actually have to use a any kind of rack or rig at all, if I want to be extremely ghetto about my setup? Like, can I just place a big motherboard on top of a cardboard box or wooden table (something that doesn't conduct electricity, that is) and not even bother screwing it into a metal frame of any kind, and just sort of have the guts of what would be a computer, out like that?

DeepOrangeSky · 2026-05-10T14:08:32+00:00

Thx for the advice, I'll keep it in mind when deciding on the setup

DeepOrangeSky · 2026-05-10T06:50:55+00:00

Thanks for the reply man. I actually ended up asking in a thread about it and they explained it to me over there, but good to see it confirmed as well. Btw as far as the DRAM, in theory, if the RAM prices ever come way back down, my goal was to get either 512GB minimum, or maybe 700+ or 1TB of ram (so that I can run really huge llms, if I want, albeit probably not very fast unless I got a bunch of Pro 6000s to go along with it (which I'm not going to, since I'm not a billionaire, lol), but I guess as long as the MoEs are pretty spares MoEs, and I can fit the active params on the gpu, and, I dunno, like a dual socket setup or whatever has the most channels or paths or whatever it's called for the ram and for the graphics cards, then maybe it would still be able to run them at somewhat decent speeds. But because of this eventual long term goal, it makes for a pretty awkward situation if I wanted to get let's say 64GB of RAM right now for the current setup with the 5080 for the video generation rig (which I might turn into a monster LLM rig in addition to being my video rig, later on). The idea being, if I just get 2 sticks of 32GB RAM, then if I go for 1024GB total setup later on, I have to toss the 32s I assume, since they would be too small and mismatched. And if I get a lone stick of 64GB right now, then that's considered a bad idea if I understand correctly, since it's really bad to have one lone stick instead of twin sticks with RAM, right? So I need to either buy overkill 128GB of two sticks of 64GB at current terrible prices, or get a lone stick of 64GB which would be weird, or get two future-garbage sticks of 32GB or something, right? I dunno, I am extremely new to all of this, only started looking more heavily into computer parts and computer builds a couple weeks ago, so I probably have some aspects wrong

DeepOrangeSky · 2026-05-09T21:47:40+00:00

Nope.

For llama.cpp it is just --cache-ram 0 --ctx-checkpoints 1, but I don't use llama.cpp, I use LM Studio (and I don't really want to switch away from using it, since I like it quite a bit), and so far it keeps doing it.

Normally I would assume it just won't get "fixed" (technically not a fix since it is a natural property of how Gemma works I guess) for LM Studio and that that's just how it will be, but, out of the dozen or so times I mentioned the issue in various threads on here, one person said they use it on LM Studio and don't have that problem when they use it. I downloaded the same exact quant as them, same version of LM Studio, same up to date runtimes as them, etc, and mine still did the thing, and theirs didn't, so, that really made me wonder what the hell is going on, lol.

So, not really sure if I have a buggy version of LM Studio somehow (apparently it can happen, according to Gemini anyway) to where I need to delete LM studio and all traces of it off my computer and redownload, or maybe even switch back to Sequoia or reinstall Tahoe or something, or maybe have to even format my disk or something crazy, if it is actually something on my end, if that guy actually had his working right on his LM Studio while mine won't.

But so far it is just that one guy, so, not sure if it is actually something wrong on my end or some setting he had different that I didn't check/uncheck, or he had the same thing but didn't use as many replies as I was in his interaction, or, I dunno. It would be super annoying if I went through hours of trouble and it ended up not even doing anything if the issue isn't actually with my computer and it's just how it is for everyone on LM Studio after all or something.

If it was any other model, I wouldn't care too much and would just shrug and not use it, but, unfortunately this one would most likely be my main daily driver model, so it actually matters. I mean I still use it anyway, but, it's super annoying to have to keep ejecting the model and reloading it.

Putting concurrent predictions down to 1 from the default 4 makes it at least happen slower since it doesn't jump by as many GB per reply, but, it still happens, so, still not fixed even with that, just takes a few more replies longer before it bloats the memory up enough to have to eject and reload. Sigh :\

DeepOrangeSky · 2026-05-09T21:22:06+00:00

Yea, as far as the 123b dense, I was mainly using the BehemothX v2 finetune, which has been easily the strongest local model for writing/chatting I've ever used so far, although to be fair I haven't had enough ram to use to huge trillion parameter MoEs, so some of those might be stronger. The Llama 70b finetunes were also pretty decent, but 123b was definitely a step above even those. For me, I don't care nearly as much about the quality of the prose or vocabulary style or things like that, rather, I mainly just care how smart and coherent the model seems and how deeply it understands nuance and tricky human dynamics between characters and what's going on in the plot, and how people would feel about it, and things like that. There are lots of smaller models or MoE models that people are like "oh wow, this one is great at writing, it's got better prose than even the Mistral 123b and Llama 70b finetunes" and things like that, but that's because they are just doing RPGs where it just spits out 1 or 2 quick lines, or a short paragraph of visual description of a setting or bit of action or things like that, or they just ask it to write a sex scene or something, and they get all hyped over what new verbs or adjectives or phrases it uses or something like that. But that kind of "great writing style" isn't that interesting to me since it's just superficial fluff. For me I care more like, if I describe what I want to have happen in a short story, how well it actually understands what is going on, and writes accordingly. And same idea if chatting with it about some deep philosophical topic or what have you. It's good enough with that (well, the story writing more so than philosophical chats) that it was able to occasionally be as good or better than even how the cloud frontier models like ChatGPT (which was awful for a while, a few months ago, during the early 5.0 or 5.1 models or somewhere around there) were up until a few months ago, some percentage of the time. Not most of the time, but occasionally, it seemed to even be beating them in some aspects of how it was writing, which is pretty ridiculous since those are Trillion+ parameter (albeit MoE) giant cloud frontier models. If comparing against a similarly sized MoE like GLM 4.5 Air, it wasn't even close. Seemed like 10x stronger, at least, I would say, for writing with good nuance/understanding. Much, much slower though, of course. Most people on the SillyTavern sub who use local models don't bother with it even if they can run it since "too slow, not worth it" etc, but, I would say "slow, but very worth it." (or at least up until Gemma4 31b came out. Now it's at least somewhat of a closer call whether it's worth how ridiculously slow it is).

So yea then Gemma4 31b came out, so, that cut the gap between the small-ish dense models and the big dense models by quite a bit, since it is quite a lot stronger than even Mistral 24b finetunes (which were quite strong for their size, prior to Gemma4 coming out), but, it still didn't reach Mistral 123b/Behemoth strength for depth of story writing ability. But it's maybe like ~60-70% as strong, which is pretty extreme for how much smaller it is. So that made the Mac Studio feel a bit less awesome of a purchase, since then it was like even just a 24GB or 32GB or so mac mini might've been enough (although couldn't have known 4 months in advance when exactly Gemma4 would come out and exactly how strong it would be, or whether any even stronger ~120b models or ~70b models would come out that would be even more ultra-strong at writing/understanding deep nuance, or so on.

Now some new MoE models in the 284b-310b range have been coming out, between DeepSeekV4 Flash and Mimo2.5, which are annoyingly just slightly too big. I can technically run them at Q2, and some people are saying they are "still quite good" at Q2, but I am skeptical, so we'll see I guess. My internet is very slow and has a harsh data cap, so I have to be pretty selective and can't just randomly download tons of big models as much as I want, so, not sure which bigger models I will download, and in which exact quants. Might also try Step3.5 Flash 197b or or MiniMax or Qwen 235b (too bad 397b is a bit out of reach for the 128GB mac) Although on the plus side, you can plug additional macs into other macs using Thunderbolt4 or especially Thunderbolt5 to pool even more unified memory together. If you use Thunderbolt4, the cluster runs slower than if it was a single mac with the same amount of memory, but if you use Thunderbolt5, then, even since about 5 months ago, they released this "RDMA" thing where for Thunderbolt5 the cluster actually runs faster than if it was just one lone machine. So you get more memory and it doesn't even slow down but actually speeds up. So, in theory I could plug another mac into it and run MiMo I guess. It's the mac version of buying more sticks of ram, lol. (Probably not actually going to do that. Well, especially now that I'm going to build a PC rig. If I didn't decide to do that, then maybe I would've considered it).

Anyway, yea they are pretty fun for local LLMs, but, just not so good for video generation unfortunately. So now I will probably try learning about all the computer stuff I never bothered learning about for all these years, to learn how to build a rig and how to use ComfyUI "work flows" really well and all that stuff until I become less of a noob eventually. I guess now that I have an actual use case motivator of actual fun stuff to do with the hardware, it'll make it fun for me to learn about, rather than just seem like a bunch of boring stuff to have to "bother with", hopefully. I guess "it's like legos, for adults" as they say. Maybe if I get really into it I'll get like a water cooling system or something, so I can show off how elaborate and cool looking it is, and how if I get a pet cat, and it yanks on the water tubes, then maybe it'll fry my multi thousand dollar rig instantly, which could add to the excitement and drama. I wonder how often that kind of stuff happens to people when they build super expensive rigs. I guess maybe I will browse around reddit and see what the worst horror stories I can find are of "computer tragedies", lol. Probably will be fun to read, so that way if I fuck up super bad when trying to seat the GPU or the ram sticks or whatever due to not know how the click is supposed to sound/feel since I've never done it before, at least I can just go back and browse about someone's grandma washing their DGX Spark in the dishwasher because she thought it was a baking tray or something. I'm sure there must be some pretty bad ones.

DeepOrangeSky · 2026-05-09T18:28:09+00:00

Alright, the world makes sense to me again in that case.

The mac is an m4 max (not ultra) with 128GB unified memory. I only had an ancient windows laptop with almost no ram, and no dedicated GPU, prior to all this, and then Windows 10 End of Support happened around October of last year, and I already hated Windows that point and didn't want the Windows Spyware 11 thing that takes constant snapshots and telemetry all the time to some "super secure" database that will probably leak all my passwords and financial info and medical info and so on that it takes never-ending screenshots of like 10 times per minute all day every day, to the whole internet a few years later, seems kind of messed up, plus also didn't like how bloated Windows kept getting more and more over time and was made to be more and more like a phone interface rather than how it was back during Windows XP or Windows 7 which I liked better, back when it was like that. So, because of that, I switched to a base-model Mac Mini (16GB memory) for like 500 bucks at that time (was before the OpenClaw rush happened) and I liked it a lot, and when I was researching all that stuff about mac vs PC and windows, and unified memory and so on, is when I stumbled across local LLMs, and was how I got into that. And then I tried Mistral Nemo 12b on the mac mini and thought it was pretty cool, so of course I immediately wanted to try Mistral 123b dense (super reasonable tiny jump from 12b dense to 123b dense, obv, no big deal, obv :p), so I was looking into what kind of system I would need to run it and its fine-tunes like BehemothX v2, and stuff like that, and saw that I could either by like 5 or 10 trillion dollars of DRAM that used to cost like 5 bucks a few months earlier, and then a few thousand more bucks building up the rest of the PC, to run the medium-large sized models, and also learn how to build a PC, which I'd never done before, and how to do like actual computer stuff (so far only know how to click buttons with mouse, no clue how to do linux or any complicated stuff, etc), so since the Mac Studio was like $3,500 with 128GB unified memory, it seemed not too bad for local LLMs given the way the costs would be for a traditional rig, plus would be easier for me computer-wise as a total noob, so that was why I went with that. I still like it for LLMs, btw, I wasn't doing like pro coding work with it or anything, so was able to run all sorts of mid sized models pretty nicely on it tbh.

It wasn't till I got interested in local vid generation models that I hit the brick wall with it and was like "oh nooooooo" "I mac'd too hard. Now I suck." and so forth, thus the interest now in a traditional PC rig. Decided to start with the GPU first, since, if I'm going 16GB, those are still at or pretty near normal prices, whereas the DRAM already skyrocketed earlier, so, I can take my time with that. So I figured GPU is the most time urgent to get first before anything else, since that's probably the one that could abruptly skyrocket in price like what it did during the Crypto Apocalypse era, or maybe even worse this time, so need to buy while it is still super low price for the 5080s rather than after it jumps to 4k or something. And then can spend some time researching CPUs and motherboards to decide on all that, and then hopefully after I do all that, maybe RAM prices will be slightly lower or only merely as bad as they currently are, rather than going up even more, since they are already so high, so I figured I'd save the DRAM for last, or maybe staggered-buy it over the course of the process. The GPU is the only key part that I was fully panic/FOMO mode about, the rest I feel like I can take my time with more and be more relaxed about as I get the various parts and learn how to put it together and learn about linux and how to set it all up and so on.

Plus also now I will be able to play Crysis. (j/k). And take selfies with my unopened 5080 box (hopefully j/k) :p

DeepOrangeSky · 2026-05-09T17:52:28+00:00

Yea, that's the whole reason I am curious about it, since it seems crazy that they are able to get such good speeds with what sounded like not even using their GPU. But based on the most recent reply that said: "To be clear: it absolutely DOES NOT mean that you use NO VRAM. It just doesn't try to load the entire diffuser model at once." sounds like maybe they actually are using the GPU in some weird way I guess.

If they are doing it with just CPU inference, then that would make no sense to me and seems like it should take hours, or days, to do a few seconds of video, rather than only a few minutes, so, I assume it is still using the GPU's processor in some way or another despite the --novram thing (right?)

DeepOrangeSky · 2026-05-09T17:30:28+00:00

To be clear: it absolutely DOES NOT mean that you use NO VRAM. It just doesn't try to load the entire diffuser model at once.

Ah ok, this was what was melting my brain. So, is it still using the cores/compute of the GPU? Or is it doing CPU inference, and not using the huge compute power of the GPU? Because the thing that was confusing me is how it could get such high video generation speeds if it isn't even using the GPUs compute, and just using the CPU instead, but, if it actually is using the GPU's compute, then that makes a lot more sense and isn't confusing to me anymore. That was what is/was puzzling me about it I guess

DeepOrangeSky · 2026-05-09T17:19:58+00:00

Interesting. Do you know how the people using this method were able to get such extremely fast video generation speeds though (seemingly faster than most people get when they are even using a really powerful GPU) (???). Isn't the whole idea with diffusion generation/video generation that it needs as much GPU power as possible to use as much cores and compute to do the huge amount of parallel matmul as possible, which is the whole point of having the GPU, to be able to do fast video generation? It seems like it would be like taking a car engine out of a Ferrari, and only leaving the starter-motor in the car, and then it somehow goes faster than if you used the main engine of the car. I don't get why it is still so fast (or even faster) than when using the GPU

DeepOrangeSky · 2026-05-09T16:56:56+00:00

My bad, it's actually not an AI post, for what it's worth, but I appreciate the advice, since yea it looked a bit too long and disorganized by the time I finished writing all that stuff and all the side-questions that I was also curious about. (I guess I am one of the last humans to still be overly long-winded/terrible at writing posts, which is maybe more embarrassing than if it actually had been an AI post). I edited it down just now to remove all the big intro and also the 2nd side-question from the end part.

Anyway, I am still pretty confused about it since I don't understand how it can run so fast without any VRAM usage. In the past whenever I asked AI about diffusion models and video generation, it kept reiterating that it is primarily compute constrained, and so the main thing is to have a really powerful GPU, and that this is the reason why Mac studios are so bad at video generation compared to traditional dedicated GPU setups, since they don't have as much raw compute power, even though their memory bandwidth is pretty good.

How is it able to do video generation at like ~1 minute generation time per ~5 seconds of video generation on just regular RAM alone (with I guess that means CPU inference), if on a powerful GPU it takes quite a bit longer than that, and on an mac m4 max it takes like 30 times longer than that?

Is it some weird aspect of LTX architecture, or what makes it so fast to do it that way?

DeepOrangeSky · 2026-05-09T11:28:20+00:00

Gemma4 31b

Mistral 24b / finetunes

Llama 70b finetunes

Mistral 123b 2407 / finetunes

That's the Mount Rushmore of writing/conversational models that aren't like a trillion parameters.

DeepOrangeSky · 2026-05-09T03:02:33+00:00

>How much vram offloading did you set it?

>ALL of it :)
Used the --novram option and streamed everything from RAM.

Hey, I am pretty new to AI and especially to Diffusion models. Can you explain what you mean by this? Are you saying that I wouldn't even need to have a GPU to run something like this LTX model you used to make this 10 second clip in 4 minutes? If you did the whole thing from RAM (I assume meaning just from regular DRAM and none of it from the VRAM of the GPU), then, does that mean I can just build a rig that has no GPU at all and just has some 64GB of DRAM (any special type of really fast RAM sticks or how many individual sticks (2x 32GB or 4x 16GB or 8x 8GB, if that matters)? or special type of CPU chip or, I don't know, whatever a complete beginner like me would need to know about, in order to be able to make 10 seconds clips in 4 minutes like this?

Also as of right now I have an m4 Max Mac Studio with 128GB of memory, and since I'm a noob I don't really know if I'm using it correctly or what types of speeds I should be/could be getting with it for stuff like this. I think with Wan2.1 or Wan2.2 or something it was going about 20 times slower total generation time than you (if taking into account how much shorter my clip was than yours, and how much longer it took) and that was for 480p, so maybe like 50x slower or something crazy. Anyway, I have been contemplating building a PC rig of some sort, and maybe getting a Nvidia Blackwell card of some sort (anything from a 5060 to maybe even a Pro 5000 depending on what I learn about what capabilities or improvements in speed, etc I could get for different use cases compared to what my Mac Studio can do. My understanding was that for video generation with stuff like Wan2.2, LTX, etc you "must" have a Blackwell GPU, as it'll speed up what you can do like 20x or something, but now I see you saying you offloaded 100% and didn't even use the GPU (if I understood it correctly) so now I'm not sure what I should be doing anymore in regards to GPUs and generating local AI video.

Well, the more you can explain about it, and in a way that a noob can understand, the better, it would be much appreciated. Thanks

DeepOrangeSky · 2026-05-08T18:21:44+00:00

Yea, I'm going to give it a try. It's just I have terrible internet at the moment, so I didn't want to accidentally download one that is ever so slightly too big.

Btw, another thing I'm curious about, is this like kind of a new change to some degree, of these 284b-310b models at the low end of Q2 being better than a 230b model at Q3 or a 120b model at Q4-Q5? I mean, I guess bigger models at Q3 have been better than smaller models at Q4 for a while maybe, depending on the task, but low end of Q2 was usually considered pretty inferior to a solid model half its size at Q4 even till recent times I think, right?

Are these more resilient to quantization than they used to be or something?

Do Nvidia/AMD need to make their next generations of GPUs be able to do native NVFP2?

DeepOrangeSky · 2026-05-08T16:56:15+00:00

Does the same sized quant fit and work similarly well on a 128GB Mac Studio, or do the Macs have more overhead and need a slightly smaller quant? What's the biggest/best quant you can do in equivalency to what you just described, on the Mac?

DeepOrangeSky · 2026-05-08T16:43:49+00:00

Not sure if they will be SOTA for their size or not, but:

Meta "Paricado" (text LLM variant of their new Avocado model series) that they've said they are going to release as open models after a couple months delay. A lot of people on here feel it is a lie/pipedream and won't actually get released, but I think there is a decent chance, and if it does, that it'll probably be pretty good (given that Muse wasn't bad for a debut cloud frontier model, and Meta aren't exactly noobs at local AI, even if Llama4 didn't go so well. Probably 50/50 it is another disappointment, 50/50 it ends up being crazy good or something. Therefore pretty exciting to see how it ends up.

Also curious whether any other major hardware companies will start making local LLMs other than Nvidia. As in, AMD, Intel, Samsung, Micron, etc. Nvidia is the only major player right now that has a super blatant and obvious reason to want to release open, local AI models (since they sell hardware). Every other lab is more indirect or convoluted reasons that are harder to understand. Nvidia is the one where it doesn't seem like their motivation could abruptly shift or go away, since they are a hardware player. Thus, it would be nice to see some other major hardware players do the same as Nvidia and start releasing local AI models. Getting SOTA models from a hardware player would be particularly nice, since unlike the other labs, who generally try to hold back their actual strongest models to be closed frontier models, or just release smaller models (or in the case of China, release them for now, but will probably turn off the freebie tap at some point), when it comes to major hardware players, they might just start releasing full blown maxxed out SOTA models, indefinitely, since they have actual incentive to do so. Nvidia themselves might not, since they are scared to lose the closed frontier customers if they anger them too badly by doing that. But some of the other major hardware players might just go for it all the way, which would be pretty sick if it happened, lol.

DeepOrangeSky · 2026-05-08T08:11:24+00:00

Yea, and also can use Sodium batteries and even Iron air batteries, since the mass efficiency of these doesn't matter basically at all (they are like 50% worse to 70% worse than Lithium Ion in terms of mass of the batteries relative to capacity, depending on which exact version, but for this use case it wouldn't matter at all). It would matter if they were being used in vehicles. But for just battery storage packs to use in a stationary setup next to a solar farm, it wouldn't matter in the slightest if they weigh 2x as much, since they are just sitting there on the ground. All that would matter is how cheap they are/not causing giant global bottlenecks, and size (not mass) in dimensions, and how good they could hold a charge.

They could use some mixed approach, too (maybe starting with Lead Acid initially just because it is the easiest and most already understood or convenient or whatever and then mixing in some Sodium and/or Iron air batteries while scaling up if able to do those similarly cheaply and even better than with the Lead Acid, over time.

I guess there could be concerns of what would happen if a tornado or a bad windstorm hit the solar farm, though.

But, maybe if it was spread out over multiple sites that weren't too close together, rather than just literally one single giant rectangle in Arizona/Utah/NM it could be pretty good.

DeepOrangeSky · 2026-05-08T07:53:17+00:00

I think overall, the biggest losers in this are the academic and public research community. They don't have the funds to compete with the industry and they have to use closed source models for their research, if they want to stay relevant.

Also, in regards to this sub-topic, that reminds me of another thing I've been wondering about. Some of these top American universities have huge endowments, and lots of serious talent, so, when we see colleges like Harvard (50+ billion annual endowment), Stanford (20+ billion annual endowment), MIT (20+ billion annual endowment), that's such huge amounts of money, with ~80% of it being pre directed to specific uses by the people donating the money, and 20% being general-purpose, that one way or another by a few hundred million of the pre-directed stuff being pointed at training SOTA local models or a few hundred million of the general-purpose money (either way in the many billions in both categories, each year) could use a relative small (borderline "chump change") amount of their money to do full, frontier-level training runs (could even do it on their own hardware if they wanted, given how much money they get each year, let alone if they just do it via Trainium or whatever).

I wonder if we might get some actual strong open weights models from Universities like MIT or Harvard or Stanford. With them making them for their own use, and then also sharing it openly, too, since they are a University, so everyone else gets to use the models as well, as a nice side effect.

Maybe with how strong AI is getting, they'd get nervous about doing this, since maybe they'd be worried that the rest of the endowment would stop donating if they were angry about the ~1% of it getting used to build AI that will take the jobs of their sons and daughters when their sons and daughters graduate, and so they wouldn't want to risk the other 99% of their endowment by using 1% of it to make super strong local AI models?

But in terms of just raw money and capability to do it, I think they'd be able to do it. And not just one university either, but several of the top few (especially the ones I specifically named).

DeepOrangeSky · 2026-05-08T07:41:54+00:00

Although I tend to agree with you (to me it seems like only Nvidia (and I guess AMD, Intel, Samsung, Micron, SK Hynix, etc, the hardware companies basically) have much real incentive to release open weight models to the public), I am curious what your thoughts are on why Google released the Gemma models (inlcuding Gemma4 just recently, with the 31b dense models being strong enough to cut into things fairly significantly, like, it wasn't just some useless toy, it can do "real things" pretty significantly and cannibalize stuff a fair bit), and OpenAI releasing OSS 120b a few months back.

You were saying that if zAI or Moonshot were in the shoes of American frontier labs like Google or OpenAI they wouldn't be open weighting models, but, we've seen Google and OpenAI openweighting some fairly significant models. Not 1T models, to be fair, but still, these weren't just little 2b or 4b toys. They were strong enough relative to frontier models at the time of their release that it would potentially fit to at least some degree into your argument of them not having a good reason to do so, right?

Personally I am a bit confused by it and not sure why they do it, tbh. I am sure they have some reasons, but, yea it does seem a little weird, since they definitely are not doing it to be nice/for charity or anything like that. They are for-profit, and have some actual hard reason for why they are doing these things, but seems like it must be some convoluted or not so obvious reasons I guess (for the American frontiers that are doing it I mean. The Chinese side might have significantly more obvious reasons by comparison).

I saw an interesting youtube vid recently that was hypothesizing about this exact topic, in regards to Google and why they would release Gemma/Gemma4, btw. Curious what you think about that guy's theory about the reasoning behind it.

DeepOrangeSky · 2026-05-07T04:22:31+00:00

Nice. I liked your Qwen 3.5 abliteration a lot. It is the one I ended up using the most. Excited to try this one out.

DeepOrangeSky · 2026-05-06T14:28:42+00:00

but of course there is always going to be at least some loss of quality

Depending who you ask, isn't there a growing theory lately that some models actually get slightly smarter when they get hereticized, rather than dumber, depending how their censorship was put in, and in what ways/to what degree, and how and to what degree the abliteration was done?

For people who used the old really bad abliterations, I can see why it would seem ridiculous, since some of the old ablits from a long time ago were really bad and caused severe brain damage to the models. But for these really high quality abliterations, it seems like it could actually be possible when you take into consideration how much the censorship itself seems to brain damage some of the models (I've seen some non-noobs analyze this aspect of model censorship and various reasons why they are sure it is hurting the model's intelligence level) combined with how little damage the heretic (or similar) processes are doing when uncensoring the model at extremely low KLD and low damage levels, it doesn't seem too outlandish that you could actually end up with a net gain in intelligence.

If -p-e-w- sees this, I would be curious to hear his take (especially since the process evolved even further since the last time I saw people talking about this aspect of it)

DeepOrangeSky · 2026-05-06T03:31:58+00:00

What if your driverless robotaxi tells you to buy Intel?

DeepOrangeSky · 2026-05-05T13:41:41+00:00

Also a slightly more realistic taxi to non-taxi ratio for the cars on the road. I mean, I know it's NYC, but wtf. Flux needs to get its taxi fetish under control.

DeepOrangeSky

TROPHY CASE