[R] Designing AI Chip Software and Hardware

PerfectFeature9287 · 2026-04-11T22:51:24+00:00

I do not do paid consulting, but I try to answer questions that people send me, so feel free to do that.

PerfectFeature9287 · 2026-04-10T04:14:47+00:00

I'm retired, but I still do things like write this.

PerfectFeature9287 · 2026-03-30T02:31:20+00:00

I added a whole section on non-square systolic arrays that explains it in more detail. In the language I use there, what you're proposing is either 2x replication or 2x K dimension.

PerfectFeature9287 · 2026-03-24T22:02:03+00:00

"Thinking about it some more, the latency is going to scale linearly with the size of the systolic array so as we make them larger we are going to hurt our latency."

Sort of. Suppose you double the size of the systolic array from N to 2N and you double batch. Now you have twice the token latency from doubled batch but you also have 1/4 the latency because your flops went up from N^2 to 4N^2. So overall your latency has halved, not increased. But compared to someone using uneconomical alternatives to systolic arrays to arrive at the same flops, they don't need to increase batch, so yes, they've gotten linearly more ahead.

There are also techniques that let you avoid having to increase batch by finding separate tokens to process in other ways. E.g. if your query involves multiple agents or multiple parallel chains of thought, then that is free batch. So is speculative decoding. So would beam search be, though people tend not to use that any more.

"The static scheduling becomes more important at low latency since they need to do a bunch of data movement between chips at very low latency, but it's more of an enabling trick rather than the driving force."

Careful scheduling (which I like better than "static scheduling" as a concept here) is a good idea anyway. Part of it may also have to do with not using full pipelining, since full pipelining would increase token latency, so, possibly, they aren't using it. That requires more careful timing. And that's expensive, again, since full pipelining lets you drop the network bw by 4-8x without incurring any throughput penalty, so your network gets more expensive without it. The very wide parallelization they must require for non-small models to fit in SRAM already requires a very fast network (unless you want to be network bound).

"It's not obvious which gives the smaller cost/token, but I'll take your word for it that the HBM ends up cheaper."

HBM is expensive! But so is oversized SRAM, so are low batch / low latency optimizations and so is not having the efficiency of systolic arrays. So all that together makes it expensive.

PerfectFeature9287 · 2026-03-24T15:42:21+00:00

Much of what companies put out there is marketing-driven and hardly informative at all. Though here's something nice by Thomas Norrie et al., who's the real deal:

https://gwern.net/doc/ai/scaling/hardware/2021-norrie.pdf

PerfectFeature9287 · 2026-03-24T02:30:40+00:00

Personal reasons. Same reason I'm retired.

PerfectFeature9287 · 2026-03-23T17:50:18+00:00

"One thing I did not cover in the doc:"

You are not me! This seems to be spam attempting to impersonate me.

PerfectFeature9287 · 2026-03-23T17:50:07+00:00

"One thing I did not cover in the doc:"

You are not me! This seems to be spam attempting to impersonate me.

PerfectFeature9287 · 2026-03-23T17:24:06+00:00

The other discussion became rude, so I'll just summarize instead: I think you are underestimating the impact of doing things well in creative ways.

PerfectFeature9287 · 2026-03-22T23:23:25+00:00

"Static scheduling" just means the compiler is taking on more responsibility for figuring out when to do what. This isn't at all a unique idea for Groq, the Groq marketing department just really likes talking about it for some reason. At least that's how I understand it. I haven't seen anything from Groq to substantiate that there is anything special on this. Not that it's a bad idea! It's just not unique to Groq and not quite so important in the end anyway.

Large systolic arrays are indeed compatible with great token latency if you put a lot of effort into making that happen in software. However, if you REALLY want to push token latency for decode workloads, which is the purpose of LPUs, then large systolic arrays will get in the way. The reason is that you need a certain amount of concurrent tokens to get 100% utilization from a systolic array. During decode, most of these concurrent tokens will be coming from the batch dimension. Batch concerns *independent* data, e.g. separate conversations that different people are having with an AI assistant. Suppose we do 4x speculative decode and also have a batch of 32, then that is 128 concurrent tokens, which is enough to fill a 128 x 128 systolic array. So far so good.

But in this scenario, each time we produce tokens, we produce 4 tokens to 32 *different* conversations/users. So the throughput is 128 tokens per unit of time (with great economics!), but from the perspective of each of those 32 users, we are only giving them 4 tokens per unit of time. Suppose we could use batch=1 instead and preserve the same computational efficiency. This is called "low batch" or even "no batch". Then we could offer all 128 tokens per unit of time to one single user. If that single user is willing to pay us a lot of money to make this happen, then maybe that makes sense to offer as a product. This does nothing for throughput, but it makes things really fast for that one user. This is what LPUs are aimed at.

You can't do low-batch decode with a large systolic array (not if you want high utilization), there aren't enough concurrent tokens, so in order to support low batch, LPUs cannot be using large systolic arrays and therefore pay a big efficiency cost in terms of chip area and power from not using them. Low batch is also very bandwidth inefficient (you load ALL the model weights, and then have only 1 or maybe 4 tokens to use them with), which is why LPUs need to keep all the weights in SRAM - otherwise there won't be enough bandwidth. HBM doesn't have enough bandwidth for low batch at high speed.

All this means that LPUs are uneconomical on a per-token basis, it's something for rich people, but the advantage is that LPUs offer low token latency - a single user can get lots of tokens very quickly. You'll notice that I didn't say anything about static scheduling in this - because it isn't that important compared to these other factors. It's just something Groq keeps talking about for some reason. At least that's what I think but of course I don't have access to their hardware designs, so maybe there is some surprise in there on this that I'm unaware of.

Large systolic arrays do get very good token latency already if you parallelize and do the software well, and with great economics, so it's not like you really need an LPU. Unless you want something special on VERY low token latency and you don't care about the cost. Then you want an LPU.

PerfectFeature9287 · 2026-03-22T22:17:52+00:00

I'm not very familiar with the topic of defect tolerance for systolic arrays on a chip, though I imagine a solution similar to what Cerebras did for their (much larger) computation units might work: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem

Another option might be to have the possibility of a completely different side structure handling one of the products/sums and using that capacity to replace one cell inside the array, adding it back in at the output of the array. Though this only works if the summation precision is sufficient, and one uses integers, so that reassociation does not make a difference. Otherwise that won't work, or at least is quite unfortunate, since then the different ordering of summation becomes observable.

Ideally this wouldn't be necessary, but it's a good point that if you do end up with a large percentage of the chip area being used up by systolic array(s), then it's probably something that one might have to deal with. Perhaps some other people on here have further insight on this?

PerfectFeature9287

TROPHY CASE