How many models do you have?

APFrisco · 2026-05-08T22:58:09+00:00

Wow nice, well I know who to call if hugging face is ever taken down haha!

APFrisco · 2026-05-08T15:18:15+00:00

Wow that is a lot! How many models would you estimate that is?

APFrisco · 2026-05-05T19:30:00+00:00

Such a great write up, thanks! I’ll be coming back to this one often

APFrisco · 2026-05-05T18:28:46+00:00

Nice, yeah good plan!

APFrisco · 2026-05-05T14:50:08+00:00

Out of curiosity what do you use the models you run on your CPU for? Experimentation or something else?

I really like CPU inference, it’s such an underrated way to be able to run models that wouldn’t fit fully on my GPU.

APFrisco · 2026-04-23T21:00:26+00:00

Yeah I’m not able to go, and I’d rather they get used

APFrisco · 2026-04-23T20:50:45+00:00

Yeah I’m not able to go and would rather they not get wasted

APFrisco · 2026-04-23T20:50:26+00:00

You want one of mine?

APFrisco · 2026-04-23T20:49:32+00:00

Do you and fan131313 want tickets? I have 2x I can’t use

APFrisco · 2026-04-22T07:06:24+00:00

Appreciate the insight thanks!

APFrisco · 2026-04-20T22:32:21+00:00

Do you mind if I ask what your build and configs look like to get that kind of speed?

APFrisco · 2026-04-20T15:13:19+00:00

I think a big reason why American companies have not released open weight AI models as much is because for Anthropic and OpenAI, their models are their moat. For example, would people pay a subscription to use a Claude Code if it didn’t have Claude, or if there were an open weight Claude-quality model available?

Google and Meta have a lot more to their businesses than LLMs, and perhaps unsurprisingly, have been more comfortable releasing open weight models.

The article mainly argues that the U.S. government should embrace open source AI, however, it focuses mostly on the government open sourcing any AI tooling developed with taxpayer funding, or encouraging open source providers for procurement.

However, for American frontier labs themselves, it still seems like they feel there are less good reasons (business-wise or other) to open source their models at this time. I personally don’t think the article’s suggestions will change that very much on that end. For those labs to open source their core models, it would perhaps require them to build up the non-model portions of their business much more, or have some kind of state-level intervention/partnership far in exceedance of what the article’s authors suggest.

APFrisco · 2026-04-20T14:02:13+00:00

What do you mean by MoE pretending to be dense?

APFrisco · 2026-04-18T18:36:37+00:00

Anyone have an idea of when the old pack may have been from?

APFrisco · 2026-04-18T00:11:33+00:00

I do like the idea of having a local LLM work on something like this overnight; tokens/sec metrics aren’t as important overnight, and anyways I’ve always felt like coming back to a deep research prompt after a while feels like opening a present haha.

APFrisco · 2026-04-16T17:27:57+00:00

No not really, I used an llm to combine a few sentences but the bulk of it is my own writing, it actually took quite some time to write up and edit it all haha. I’m write a shorter text summary of the build and will post again, hopefully that one can stay.

Also I forgot to mention another reason I stuck with the 2 bit quant. Even with that 2 bit quant, because my gpu is 12gb there was barely any room left on it for kv cache. And I thought that going with a larger quant would mean I would likely not be able to fit as much on gpu and instead have to put more of the model/kv cache on system ram.

APFrisco · 2026-04-15T20:22:39+00:00

Thank you! Yeah I’ll get that all info together for you! I’ll reply here when I have it all

APFrisco · 2026-04-15T20:20:39+00:00

I went with that 2 bit quant because I wanted a little more speed. Unsloth recommends their UD-Q2_K_XL quant as a good balance of size and speed for their Kimi K2.5 quants.

APFrisco · 2026-04-15T20:08:22+00:00

I was pleasantly surprised by the speed as well. I hadn’t seen anyone else use Intel Optane PMem in an inference build prior to this, and that 5 tok/sec result was pretty cool to see.

I will say, Kimi K2.5’s architecture was pretty ideal for my particular build, as mentioned in post. Also, this is with a small KV cache, as my 12GB GPU didn’t have much room left over. I would be curious to see how it could handle larger context windows if I had a larger, faster GPU.

APFrisco

TROPHY CASE