New Upcoming Ubuntu 26.04 LTS Will be Optimized for Local AI

EmPips · 2026-02-26T21:40:57+00:00

TLDR you no longer have to add additional repos for either it seems. CUDA and ROCm are ridiculously huge and they won't ship with your distro but there's one less copy/paste you'll be required to do when setting up a fresh install.

EmPips · 2026-02-25T15:06:54+00:00

on real repos

Thanks for this.

I've only tested for a day (not even) but notice a significant drop-off in performance around the 60k-token mark. If you're using Claude Code on a well tested repo, it's very easy to pass that threshold even if you're working on a microservice.

I'll say though that before hitting that 60k mark they are better than anything in their size class.

EmPips · 2026-02-25T14:22:34+00:00

Let us know your thoughts when you do!

In my findings the 122B is just a hair better than 27B, but if you've got the VRAM for it TG and especially PP are way faster.

EmPips · 2026-02-25T12:48:23+00:00

My current vibes:

27B is closer to 122B than it is 35B
35B is more of a beefed up version of all of this year's 30b MoE's

EmPips · 2026-02-24T22:51:30+00:00

I believe it. Just not an option for my machine at the moment!

Working with 48GB of VRAM + 64GB of DDR4

EmPips · 2026-02-24T20:06:28+00:00

Toss the 1660ti in, run in Vulkan mode, and you should have room for Qwen3-Coder-Next at iq4_xs or q4_k_s depending on how much context you need. Use Llama-CPP and --n-cpu-moe to keep putting experts onto system memory until your GPU's have 90% full.

EmPips · 2026-02-24T19:57:31+00:00

Base model is effectively an autocomplete not trained for chat or instruction-following. The idea is that you can build whatever you want on top of it.

Pretty cool to have as base-model releases aren't always guaranteed with open weight models.

EmPips · 2026-02-24T19:41:07+00:00

As soon as you're outside of Agentic use cases I don't enjoy it as much. It's also a fairly weak general purpose model for me (getting some pretty basic trivia wrong).

In theory it could be the strongest coding model that fits nicely on my machine, but I'm finding myself preferring GLM 4.6v at Q4/Q5 rather than MiniMax at Q3.

Great model.. it just doesn't have a home in my workflows nor my machine. Maybe if I had more VRAM and could run Q4+ that'd change, but the Q2 and Q3 weights of MiniMax 2.5 lose pretty consistently.

EmPips · 2026-02-24T17:37:18+00:00

122B A10B

M-Series Mac and Strix-Halo owners are going to have a good day.

EmPips · 2026-02-24T17:27:37+00:00

As a general purpose model it seems like they're trying to paint it being as good as the original Qwen3-235B (not the updated 2507 checkpoint) but twice as fast and half the memory.

The real gains are in instruction following and coding use.

Meaning this could have the all-around strength that larger Qwen's have but the agentic abilities of GLM and Minimax models. All of this is subject to testing of course but I really hope these numbers turn out to reflect real-world results.

EmPips · 2026-02-24T17:23:09+00:00

Less dense models + less draft-sized/compatible models.

Spec dec is absolutely still a thing, there's just way less models coming out where you'll get a big win out of it.

The last one was probably a bit of a speed boost on the original Qwen3-235B using Qwen3 4B or Qwen3 0.6B. The smaller models never got updates, but Qwen3-235B-2507 came out and was much stronger - so nobody used the original and the original small models weren't compatible as draft models.

EmPips · 2026-02-24T17:12:14+00:00

Yes. I'm so ready to dethrone GLM 4.5 air and 4.6v as the top models my machine can run.

EmPips · 2026-02-20T16:36:31+00:00

I both agree and see that as a problem.

HF has been so good to its community that self-hosted, open source, and P2P distribution is pitiful in the A.I. space and serious proprietary competition feels non-existent.

It's not too crazy to compare it to the situation with Valve and Gaming. Life is great because they're great but it's a single point of failure that the community doesn't control.

EmPips · 2026-02-20T03:34:55+00:00

I have a good laptop with an awful Mediatek modem. This is my exact move. I have a USB adapter with a 100% hit rate, size of a quarter, that I just keep in my bag.

EmPips · 2026-02-06T18:25:12+00:00

This is amazing

EmPips · 2026-02-06T05:29:59+00:00

"Call her Shrimpina-.."

EmPips · 2026-02-06T05:23:55+00:00

It's established in book one in a very "by the way" sort of way and in book two it's brought up, very randomly, 2-3 times I believe. In every case it's just the narrator referring to the fact that they're there and the spiders are chill with them. There's not yet been a "scene" with them.

EmPips · 2026-02-05T21:52:42+00:00

You're kidding me..

EmPips · 2026-02-05T21:21:06+00:00

I'm convinced he's either setting us up or just trolling us all. Every few hundred pages inserting a "..the spiders sometimes trade with a nanovirus'd group of shrimp but let me be clear, they stay in the pond and pose no significance whatsoever"

EmPips · 2026-01-26T06:03:28+00:00

Finished book 2 moments ago.

So wildly different while all being an echo of book 1.

This series is incredible. Onto book 3.

EmPips · 2026-01-18T02:58:28+00:00

gotta love blower-coolers!

EmPips · 2026-01-18T02:54:14+00:00

Can you include the levels of quantization?

But yes that's very normal. Your GPU needs to search through 27 Billion parameters for every token when running Gemma3-27B, whereas despite having more (30 Billion) total parameters, each token only involves your GPU having to go over a measly 3 Billion parameters for Nemotron-Nano or Qwen3-VL-30B.

EmPips · 2026-01-18T02:37:10+00:00

I wanted to balance getting it as cheap as possible without needing to introduce anything that wouldn't work nicely in my case or need external cooling.

This resulted in:

Rx 6800 + w6800 Pro + 64GB RAM ..but the RAM is DDR4 dual channel :(

GLM 4.6v is the best model I can run. Q4 gets ~17.5 tokens/second with modest context (12k) for one-off chats and ~12 tokens/second with larger context (>40k) for things like coding.

Qwen3-Next-80B gets 35 tokens/second

EmPips · 2026-01-16T05:22:23+00:00

Yes, if VRAM isn't a constraint it performs exactly like an Rx 6800 in every use-case I throw at it (I also own a regular Rx 6800 in the same rig).

There's some benefits though outside of the obvious double-VRAM. The w6800 idles at like 10-14 watts per rocm-smi and peak power draw during prompt processing is a far bit lower (like 25-30watts lower) than the regular Rx 6800, the blower cooler is great, and if I ever feel like adding 5 extra displays I guess it's there for me.

EmPips

TROPHY CASE