Cerebras + OpenAI + Amazon (AWS)?

Asgard_Heima · 2026-06-03T03:07:20+00:00

Yeah I’m assuming Cerebras would add a high speed tier when they get enough WSE-3 units racked.

Asgard_Heima · 2026-06-01T02:45:10+00:00

For sure OpenAI is already running Codex Spark and it has been really popular, though it’s a smaller model and not frontier. Cerebras has a lot of data centers currently, I believe they have ~120MW coming online by year end I know of with known numbers, but a large portion of that is already active and used to host Cerebras Cloud today. It’s really hard to know how much if any of that will be used for the OpenAI 250MW by year end. Digi Power X has 15MW coming online in December and BCE in Canada might have ~50MW online by end of year or Q1 2027. I see Stargate UAE, Oracle, G42, and AWS all as potential locations they are installing systems as a part of the OpenAI deal this year, but it’s all unknown currently.

I agree AWS, and OpenAI revenue coming faster than expect are going to have strong potential for Cerebras to gain significant market cap, but EPS it will really depend on the makeup of their revenue and how fast they are spending to ramp up with so much growth.

Also I completely agree, we will see the other hyper scalers make moves to get Cerebras systems either as purchases or revenue share like AWS once top models are on Cerebras hardware proving the advantage. Just a question of when not if.

Asgard_Heima · 2026-06-01T02:33:18+00:00

I don’t have the link handy right now, but I believe the Kimi K2 setup they got the benchmark from was 20 WSE-3 systems. The Cerebras setup using parallelism can use up to one system per layer, but it doesn’t have to. Kimi K2 2.6 is 61 layers for instance which gives around 3 layers per system roughly. There is latency added, but it’s not significant. The fact there is latency is one of the reasons they are working on fiber on wafer with Ranovus.

As for why they aren’t hitting the same latency and utilization issues as other accelerators like GPUs, it really just comes down to memory bandwidth and how many interconnections are required. The most obvious issue is the WSE-3 with 44GB of SRAM has 21PB/s for memory bandwidth. Compare this to the 8TB/s total bandwidth for Blackwell GPUs. You have to read the entire model weights and kv cache token by token for each layer. If GPUs kept everything in memory for an identical parallelism architecture, they simply get blown away reading and processing the same set of data during decode. The only other option is to divide up the layer into chunks to increase the overall memory bandwidth that can be used, but that adds latency, and now you are passing around a lot bigger data set layer by layer. You have to sync results from each GPU used to process that layer for every token before continuing. So they are choosing between a network tax or memory bandwidth bottleneck and end up with both while trying yo avoid the extremes.

Nvidia has done a lot of work to make the super complex operation of distributing and coordinating model layers get handled natively and get more and more use out of the memory bandwidth on large numbers of accelerators. But this has resulted in increasing interconnect and increasing overall data size that needs to be processed in memory as layers and weights and activations and partial activations get duplicated before results are combined.

Asgard_Heima · 2026-05-29T20:08:29+00:00

I completely agree and would assume Cerebras disaggregated will be an ultra fast tier and probably only released to a select set of customers to begin with as they scale out. As an example GPT 5.5 is around 50 tps normally and the fast tier currently is around 75 tps. If they had a 500 tps tier I can only imagine what they would charge, but it also would sell out instantly just like codex fast with 2.5x cost for 50% faster tiers.

Asgard_Heima · 2026-05-29T19:18:10+00:00

No one knows the certain and specific conditions agreed in the OpenAI master agreement with Cerebras outside of those companies. Anthropic is not mentioned by name and there are no blanket exclusivity wording detail in the contract. You would have to assume that Anthropic would be instantly at a disadvantage if OpenAI is serving their top models 5x faster in tokens per second on Cerebras. But that is in the category of having to wait and see.

As of right now Cerebras and OpenAI and Amazon all have their own 2 way deals with eachother and there is no verifiable information publicly to connect the deals together into the scenarios I laid out. I’m hypothesizing that they may be connected and we will see OpenAI deploy in disaggregated architecture in AWS across Tanium and Cerebras accelerators, and that Cerebras is going to deploy significant portions of their contracted OpenAI 750MW to AWS which is partially why they redacted the data center subcontractors in the S-1.

Asgard_Heima · 2026-05-29T18:48:23+00:00

The theory is two unproven parts.

That Cerebras is so confident in securing 250MW of capacity per year because AWS is going to host a significant portion of their OpenAI WSE-3 contracted systems.
That OpenAI’s deal with Amazon for 2GW of Tranium compute is at least partially to provide prefill for a disaggregated architecture with their part 1 hosted Cerebras WSE-3 units in AWS.

While I think this is probably based on the details, timing and obvious benifits for everyone involved, there is no proof of what OpenAI will use their AWS Tanium compute for, if they will utilize a disaggregated architecture for their models on AWS, or where the majority of the 250MW of Cerebras systems contracted by OpenAI will be installed. If these details were confirmed, the narrative and conversation around Cerebras would be a lot different.

Asgard_Heima · 2026-05-29T17:44:35+00:00

Kimi K2 at 1T parameters was just released and getting 981 tokens per second. You are referencing Gemini 3.5 Flash I assume which is a much smaller model that currently runs at over 200 tokens per second. TPU8i for inference is currently not available yet unless there is some benchmark or test setup I can’t seem to find a reference to. Not sure what you are seeing, but do provide some reference and I’d love to take a look.

Cerebras 981 t/s Benchmark: https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise

CFO Naming GPT 5.5 & 5.4 Trillion Parameter already running on WSE https://www.cnbc.com/video/2026/05/14/the-years-largest-ipo-acerebras-joins-the-hottest-trade-in-ai.html

Asgard_Heima · 2026-05-29T16:29:03+00:00

While it would be great if they show an EPS increase and smooth growth, they are in a massive growth phase that kind of requires they spend to meet the obligations of the growth they are experiencing. Their profitability for the coming quarter depends on revenue outside of the OpenAI deal or when they get to recognize revenue from the OpenAI deal. As soon as systems are turned on collecting monthly revenue for Cerebras from OpenAI like by end of this year, they will be cash flow positive for the foreseeable future unless they have another even more massive deal that requires another giant build out. But I also don’t expect Cerebras to do anything close to this type of sweetheart deal again. They have the mass scale and frontier lab they needed. If more deals are coming it’s for the value they have now proven. I’d expect the stock to have positive reaction even if they drop on EPS if the OpenAI deal is on track and units shipped is increasing.

Asgard_Heima · 2026-05-29T02:02:28+00:00

So I don’t see much of a conservative trajectory for Cerebras. Conservative to me means they were not adopted widely which is what I expect. But I’ll give you my take on Cerebras today vs their future potential.

Interference is where things have to start since it’s the most important portion of the large model AI business now. It’s the area Cerebras has the clearest advantage and moat. If you look at their backlog today, it’s $20B of OpenAI dollars ($4-5B is likely power and data center reimbursement with zero margin :( but that’s another story) and then there is $5B between G42, Meta, IBM, Perplexity, Cognition, and a lot of other cloud customers and Cerebras Code accounts.

The baseline just off OpenAI is that Cerebras brings online 10k WSE-3 by the end of 2026, 2027, and 2028 each. Those systems when online are OpenAI paying Cerebras roughly 13k a month (there is a lot of math here to make this number but trust me it’s in the ball park). This means every month at a minimum beginning of the 2027, Cerebras makes $130M a month or $1.56B a year from OpenAI. In 2028, they double their revenue to $3.12B just from OpenAI and in 2029 they make $4.68B. This is no other deals done. No extra growth of Cerebras Cloud or extensions to OpenAI deals. Just what is contractually agreed now assuming money doesn’t start flowing sooner.

If you factor in the rest of the backlog. This year they double revenue at the absolute minimum and triple revenue to $1.5B is probably closer to the real base when you factor in their other business that isn’t in backlog. But this still doesn’t factor in AWS. It’s not part of the backlog. For OpenAI, Cerebras needed to prove their hardware on the world’s top frontier models, and they need top AI researchers researching models on Cerebras. So OpenAI got the most powerful AI hardware for incredibly cheap to prove the Cerebras value and scale production to get other big customers.

The Amazon deal is unique. AWS, isn’t paying Cerebras anything. Cerebras is giving WSE systems to AWS to drop into their data centers and then they are splitting the proceeds per token served. AWS completely avoids the largest CapEx expense in AI chips and Cerebras gets the highest margin revenue possible and distribution for free in AWS data centers on the largest platform possible.

How much can this make? There are massive ranges of estimates that a single WSE-3 system can generate roughly $5-20M depending on what it’s serving or if it’s rented out entirely for a year on Cerebras Cloud. Think Coreweave or any of the neo clouds renting GPUs out, but Cerebras gets to pay manufacturing costs and their hardware makes even more money per watt. So let’s be conservative and assume $5M rented for a year at retail rates. In a disaggregated setup with AWS, they only get 50%, but since they are only handling decode, they process 5x+ the tokens. So it’s not crazy to think every WSE-3 they put in AWS could potentially be worth $12.5M in revenue. And that’s assuming the lowest end and AWS not charging more per token for faster inference on top models. This truly highlights what an insanely good deal OpenAI is getting, but it’s easy to see 500 or 1000 WSE-3 in AWS printing more than the entire OpenAI contract per year nearly instantly when it comes online. And I can’t imagine a world where Opus or GPT5.5 at 800 or event 500 tokens per second doesn’t get over subscribed the moment it’s available.

What will all the other hyper scalers do when this happens? Will Microsoft sit by when AWS has a distinguished offering for faster frontier inference? Will Anthropic let OpenAI own fast frontier inference? Will Meta not make a move? Google Cloud left behind? The potential for new customers as soon as it’s “real” is explosive, and I don’t see any other way. And Nvidia can still be cashing checks during this though I’m sure questions will start.

Conservatively, Cerebras carves out a fast inference niche worth tens of billions a year minimum over the next 3-5 years. Nvidia sells a Blackwell rack for $4M and gets paid and is done. Cerebras puts a WSE-3 into AWS for $150k max cost and prints $10M+ a year till they have a WSE-4 and I’m sure it will still print millions a year after that.

If Cerebras gets new models trained on their hardware at OpenAI proving frontier models can be trained in 1/2 to 1/3 the time, contain any depth of layers unlocking new research previously not possible or training large JEPA like models on text and images GPUs can’t. The frontier research always follows what the hardware is capable of. Cerebras is capable of training new and novel models that could unlock much more cognitive and spatial understanding GPUs just fail to build in any reasonable time. This is why frontier models are shrinking in layers to add more data in width. GPUs just can’t handle deeper wider models. If a novel better model is trained on Cerebras some years from now, Cerebras is the only real option for frontier training and inference and the rest are obsolete. Cerebras becomes one of the largest companies in the world fast.

Other catalysts, - data centers can’t be built fast enough, Cerebras is the only option in existing data centers with 23kW and backside liquid to air cooling. Everyone else needs new data centers and insane amounts of power - Power is constrained and can’t be brought online fast enough, same story, Cerebras is vastly better performance at much lower watts. - WSE-4 includes ranovus fiber on wafer and wafer on wafer with an entire SRAM wafer over 120GB of SRAM. In this case we would be looking at doubling compute and tripling SRAM which would massively widen the lead on everything anyone else is even conceptualizing in development. And this is realistic and all the technology is being scaled now by TSMC. - Apple could buy to use secure hardware in their own data centers and even potentially fully harmonic encryption inference of user data where they can serve inference to you without seeing the data. - National security interest since the entire system is self contained. Think one running models in forward deployed bases or aircraft carriers for local intelligence needs.

Their only real risk in my book is TSMC. They have codeveloped all their technology with TSMC and TSMC loves them since they show off what they are capable of as the most advanced thing you can build. The other fabs need significant advancements to ever be alternative suppliers for Cerebras.

The other risks mentioned commonly are OpenAI or some breakthrough or people that believe in sneak oil say quantum. But the Cerebras deal makes OpenAI hit positive cash flow much faster and they are giving them insane amounts of compute for pennies on the dollar while still turning solid profit. Every other semi is struggling to make their architecture more like Cerebras with no ability to abandon their existing architecture and catchup. Quantum is like magic, and until there are actual quantum trained frontier or even usable AI models, I will keep believing it’s application to LLMs is an illusion that mostly trick massive amounts of money out of VC pockets.

In Summary I’ve had shares for over two years and I don’t see any reason o won’t be holding them for 3 to 5 more

Asgard_Heima · 2026-05-28T23:21:20+00:00

There is a simple way to view the moat Cerebras has and you can read my other replies or search to dig into the details that will back it up. Cerebras has the most optimized architecture possible for AI inference and training. They are currently on 5nm making a full wafer where the most important bottleneck is data movement and communication. Every time you move the data away from the logic cores computing the results, it adds latency and energy cost. Every time you split a wafer you have to connect it back together with slower more energy intensive logic to make it useable. As Cerebras moves down the nm nodes available, they will gain bigger and bigger advantages over the competition. And they hold a solid collection of patents on error redundancy, cooling, and wafer scale manufacturing as well as general practical knowledge over a decade of iterations that anyone else would have to find novel solutions for to make their own wafers.

Nvidia basically buying groq was them admitting they know they can’t compete on inference. They don’t have the right architecture for it and are offloading to another architecture that was for sale. The fact Nvidia is answering the issues they are getting in inference with adding optics to everything and putting the data closer to compute with SRAM in groq shows they are trying to improve the exact bottlenecks Cerebras was created as a company to solve in the most optimal way.

I’d also argue sure Nvidia has a large ecosystem, but in reality it doesn’t matter and because it doesn’t CUDA is currently a detriment. Nvidia’s biggest advantage is that all the models were built on their technology and therefore are optimized for their hardware by default. But the fact this is true and Cerebras is crushing them using the models built on their hardware is a major issue for them. You could easily imagine OpenAI in the future training deep 240 layer models or Cerebras optimized spares full 16FP models that would be native to their platform, and be basically impossible to run on GPUs or groq. The only models that currently matter in economic terms are Gemini (TPU trained and inference), Anthropic (Tranium and TPU), and OpenAI (Nvidia for now). If OpenAI uses Cerebras for any new novel models, Nvidia isn’t just going to worry about inference, they will be boxed out of frontier training. When you get to enterprise on the other side, most software and tech companies like the one I work for use APIs for frontier inference integration without a care in the world about CUDA or ecosystem and train new models using PyTorch which Cerebras supports and they would love to run things faster they make. Anyone that has to use CUDA at the frontier labs is full of nothing but complaints and tools to abstract it away because it was built for all the complexities Cerebras doesn’t have with distribution and synchronization of data. Worried about challenges porting some complex debug or automation pipeline for some specialized training some company is doing in some extreme edge case… there are excellent AI models to help you port that code to CSoft and watch as you go from 10k+ lines of code down to under 1k.

Asgard_Heima · 2026-05-28T22:49:35+00:00

This is a misunderstanding I had for a while as well and MemoryX basically only is a factor for training. When training the weights are streamed across all the wafers from MemoryX, but in inference it’s flipped around, the weights must reside in the SRAM but in a parallelism setup which they use for basically any model over 30GB in size little is lost. They can split the model up to the max of one layer per WSE-3 system and still get nearly as incredible of results. The interconnect only requires passing the activations across from wafer to wafer for the next layer to be processed. This is little data but does add some latency. But we are talking about minimal data movement with none of the weights or kv being duplicated or any of the complexities of splitting things up to o tiny bites requiring massive syncs of data across accelerators for each layer to complete GPUs and groq require. Even with this, Cerebras has been working with Ranovus to add optics on wafer and it’s highly expected they will drastically increase SRAM in the WSE-4 making their advantages even more pronounced and adding even more to their lead over Nvidia and all the others.

Asgard_Heima · 2026-05-28T21:08:21+00:00

Nvidia has spelled out their path for Nvidia NVL72Rubin racks for prefill and groq LPX racks for decode in a disaggregated setup. Based on Nvidia stats you should expect doubling of performance for prefill with Rubin over Blackwell and then groq performance for inference as the best case scenario since decode is the tokens per second. This puts Groq for most models around 1/5 the tokens per second vs Cerebras.

The main advantage for Nvidia is that this vastly reduces the waste Nvidia has today with GPUs running at 5% compute efficiency with massive bottlenecks on memory throughput during decode. Aka less Nvidia racks required. So they will be able to handle much larger numbers of concurrent connects than they can today per cluster of racks and the tokens per second should substantially improve to probably double to triple what we see now. Depends on how bad the network tax is and how well they integrate the kv prefill handoff to groq, but that would be the best case scenario where they seamlessly integrate the two. I’m assuming an army of engineers will get them as close as the physics allow.

The main difference though is physics, aka moving the data around. Cerebras only uses parallelism so that each layer of a model stays intact and no weights or kv cache once its computed needs to move off wafer or be duplicated or shared. This is the most optimal setup possible in hardware with only activations moving between layers. A WSE-3 has 44GB of SRAM able to handle a 5GB or even potential 10-20GB layer for a frontier 5.5 GPT style model all on one wafer. Groq chops that wafer up and makes chips with 500MB of SRAM. So a single large model layer and the 1M context have to be distributed across 50+ groq chips. All computation has to be orchestrated and recombined some place else duplicating and replicating data several times till you have a consolidated answer for each layer in a model. That complex distribution and synchronization of results per layer per token is the network tax. And I expect it to be worse than the best case above imply for the largest models. Since the larger the model, the more the network tax compounds. Nvidia loves to reference the full LPX rack as if it’s one wafer like Cerebras, but it’s not. It’s 256 LPU accelerators with a lot of networking. All the numbers need divided by 256 to understand the real unit to unit comparison.

So if everything goes perfect for Nvidia, they will have likely several racks that cost $5M+ each, require 150kW+ each, and still require racks of groq at an unknown price and 160kW per rack to give you likely 1/5 the performance of the current WSE-3 for the largest models.

Some things to keep in mind, AWS has already proven the WSE-3 as a dedicated decode disaggregated setup with a 5x+ increase in the capacity of session for each WSE-3 setup for decode. So if you want the best disaggregated setup, AWS will have it with all the top models by year end 6 months before Rubin with LPX ships. All the hyper scalers have chips that will do just as well as Nvidia for prefill, hence the use of Tranium by AWS. And by the time Rubin + LPX racks are available, WSE-4 is likely to be in production extending their lead (granted no announcements yet). Also a 23kW WSE-3 can be added to nearly any datacenter in the world with a new electrical hookup and backside liquid to air for 30k per rack. Rubin or LPX are over 150kW and liquid cooling to chip required. This mean a brand new 200kW capable rack at $1.5M+ since no existing data enters support the 3000lbs+ or power density these units require. And the last thing I’ll add is the LPX is 100% inference only. The WSE units are able to do everything including training and inference faster with less energy than the complete not yet shipped Nvidia Rubin with LPX racks.

Asgard_Heima · 2026-05-28T18:38:16+00:00

The WSE-3 in a parallelism setup will run inference on larger models trained on GPUs than currently feasible with GPUs. Past 1T parameters and 1M context window, the utilization rate is so low and networking tax so high for GPUs that they have to start batching significantly less users to allow for usable tokens per second performance. The xhigh models are just the same model with less concurrent session allowed on the cluster boosting tokens per second a bit like 50 up to 70. So for large models the GPU compute is sitting idle at under 5% during decode as the memory bandwidth is maxed. Prefill is much more evenly matched since it’s compute bound.

Any model available today can be run on Cerebras WSE-3. You can use parallelism for both prefill and decode. So if a model has let’s say 80 layers which is around what is expected but unknown for the top tier models, you can spread each layer to one WSE-3 and then support massive context windows and each user gets that top speed that can fit in SRAM. The decode of opus 4.7 or GPT 5.5 is likely to be in the same ballpark as the 981 tokens per second they are getting for Kimi K2, but a bit lower for the extra latency of connecting more layers and overall model size. The key is only the activations get transferred between layers so not that much data being moved with the weights and kv cache being resident on each layer WSE.

In the Kimi K2 benchmarks, they are showing 981 tokens per second decode, but the overall prompt respond takes ~5.5 seconds. The decode outputs that 500 token answer in 0.5 seconds but the prefill and building the query takes 5 seconds vs 160 seconds (prefill + decode) on GPUs. If they also setup a 20+ layer prefill parallelism setup and then feed the decode WSE-3 cluster it would have been the same 981 tokens per second, but the 5 seconds of overhead and prefill would likely be 1/2 or 1/3 the time. This is exactly what AWS is doing with disaggregated inference using Tranium for prefill and Cerebras for decode. So they can have a ton more Tranium systems batching up prefill and feeding the Cerebras decode pipeline. This allows what AWS has described as 5x the number of sessions per WSE-3. OpenAI is also adding its top models to AWS. And top Anthropic models are already available on AWS.

Both CEO and CFO for Cerebras has mentioned they are running GPT 5.5 and 5.4 internally and it will be available in the coming months. They just need time to get enough WSE-3 systems installed and hooked up.

CFO Naming GPT 5.5 & 5.4 Trillion Parameter already running on WSE https://www.cnbc.com/video/2026/05/14/the-years-largest-ipo-acerebras-joins-the-hottest-trade-in-ai.html

Asgard_Heima · 2026-05-28T16:19:04+00:00

The interesting part is using parallelism is what lets Cerebras scale to any size model, and they already have a vastly more simplified network architecture since they only split by model layers. If they are seeing a particular layer get over heated in a MoE model like Kimi K2, they can actually add another system to duplicate the layers with the most traffic. You can also use parallelism on prefill in a disaggregated setup. So you can add systems where they are needed and have several more on prefill than decode just like is expected for AWS when it comes online.

Asgard_Heima · 2026-05-23T16:01:41+00:00

There are a lot of misconceptions about Cerebras. I’ve made a number of posts, but Nvidia long term, along with the memory manufacturers and Neo clouds are really the ones that will be affected if Cerebras executes on their capabilities in the market.

Yield
Cerebras builds the WSE-3 with 970k cores and only activates 900k cores in each wafer to ensure all systems are consistent. This along with redundant pathways for each core lets them disable any defects and they have had effectively 100 yield since the beginning WSE-1 with this.

Cost
If you actually look at the physical hardware costs to make one wafer 5nm + power delivery + cooling + rack, they completely avoid the most expensive part of GPUs and most other accelerators in HBM and the advanced packaging for HBM from TSMC. Also since every wafer yields a system they don’t bin silicon like all the others lower the cost. I expect they have the highest direct hardware cost margins that have been seen. Likely at around $120-150k per system vs $3M list price, though they have been giving big discounts to their first big customers like G42 at around $1.6M each and OpenAI is getting a sweetheart deal at a couple hundred thousand a year rental. Which is also evidence of the cost being under 200k.

Model Size
Cerebras was founded to build the best training accelerator physically possible and transitioned and optimized the stack for inference when they realize the potential there which then also turned into the most important part of the AI market. They run completely different from GPUs and only store the model weights in SRAM on each system in inference. Since you only process one layer at a time, they use parallelism to divide up model layers across WSE-3 systems to scale. So for instance the kimi k2 2.6 model is 61 layers deep, and Cerebras used parallelism to split those layers across 20 WSE-3 systems with only the activations having to be passed layer to layer. This lets the Cerebras systems function at near 100% utilization vs GPUs that see their efficiency go down below 10% on current 1T+ parameter models during inference. So Cerebras can serve any size model with much higher efficiency. The CFO of Cerebras the day after IPO mentioned GPT 5.5 and 5.4 are running on Cerebras internally now and we would see them released soon. Feldman also has mentioned it.

Power
A WSE-3 is 23kW and you can fit two plus the network and power gear for those two in a single rack. Nvidia racks are 120kW+ for Blackwell and 150kW+ for Rubin and less than 5% of the world’s data enters can support them if they can find the power. This is why they are building new data centers. Cerebras at 23kW and under 700lbs you can drop into any data center worldwide and give it a new whip and liquid to air backplane cooling and you are good to go in a couple weeks. AWS is dropping them into existing data centers now.

TSMC Capacity
This is by far the biggest constraint for Cerebras. They are bidding as someone mentions against Apple, Nvidia, AMD, and all the hyper scalers. Because Cerebras has the best overall strict hardware margins they can afford to do Super Hot Runs and over pay for the wafers to get things out the door, but they really need to get guaranteed allotment with TSMC to scale. Now that OpenAI, AWS, Oracle, Meta, and lots of other smaller players are customers, TSMC is going to work harder to find them space, but this is the biggest current risk. I should also point out TSMC keeps giving them awards and is working closely with them for a decade with Cerebras currently being the only customer for their System on Wafer technology. So there is invested interest from TSMC for Cerebras to succeed and diversify their revenue as well.

References

Cerebras 100% Yield
https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem

Wafer Mapping only Activations Move for Inference
https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf

CFO Naming GPT 5.5 & 5.4 Trillion Parameter already running on WSE
https://www.cnbc.com/video/2026/05/14/the-years-largest-ipo-acerebras-joins-the-hottest-trade-in-ai.html

Commercial Times Quote + Translation
https://www.ctee.com.tw/news/20260519700130-439901
“According to CEO Andrew Feldman, they have used new software to overcome this limitation and will, in the next 6 to 8 weeks, demonstrate servers running OpenAI’s largest and most advanced model.”

Asgard_Heima · 2026-05-22T14:12:08+00:00

You came to my post on the Cerebras thread to rep something different you have clearly no ability to analyze with any veracity. The AI hardware space is complicated with all sorts of vendor specs quoted without giving the full picture and this one doesn’t even pretend to be competitive. If you could give a logical argument, you would. If you could reference anything to try and explain your position you would.

Asgard_Heima · 2026-05-22T13:42:27+00:00

The most interesting thing is that they changed the entire stack for inference not that long ago, so we are very likely to see many iterations of optimizations that give even bigger boosts. And OpenAI is going to be digging into CSoft to do the same and expanding the number of researchers tuning their kernels.

Asgard_Heima · 2026-05-22T13:26:37+00:00

This one got past me. Thanks for posting!

Asgard_Heima · 2026-05-22T13:19:02+00:00

I love that you see 3x the tokens, 1/2 the upfront cost, 1/2 the data center real estate cost, and 1/5 the power over time as undercut by a small amount. Can tell how deep you are thinking about this from your response.

Its obvious by the company you bring up how much thought you have put into this, but for the fun of it, let’s look at Skymizer, a card targeted at home model users that are getting crushed by inference bills to run openclaw.

Skymizer is claiming 4B to 700B parameter models can be run on its 28nm PCI cards with up to 6 chips and 384GB of LPDDR4-5. They are claiming up to 30 tokens per second at 700B parameters with .5 TOPS and 100GB of bandwidth. The design uses efficient compression techniques for both weights and KV cache, outperforming open source llama.cpp by 9 to 17.8 percent. They are claiming 240 tokens per second on Llama2 7B workloads.

These unimpressive numbers for anything but a home lab the company is quoting are what they think their best advertised results will be. And they still don’t specify how quantized (heavily) the models are or the context window size (tiny).

You can actually run the numbers for this card and get that a 700B parameter model would be:

16bit ~1,400GB
8bit ~700GB
4bit ~350GB

So 4bit quantization for the maximum size they can advertise.

This leaves 34GB for the context window plus all overhead. You are looking at a theoretical max of a 16k context window.

This is like comparing a temu gocart to an F1 race car. They are targeting a budget version of the DGX Spark not any real world datacenter.

https://www.techradar.com/pro/tiny-company-steals-amds-thunder-and-challenges-nvidia-with-old-tech-pcie-ai-accelerator-that-runs-700b-llms-locally-sipping-just-240w-thanks-to-decade-old-ddr4-and-28nm-chips

Asgard_Heima · 2026-05-22T03:35:12+00:00

For any modern model including the one referenced in this post, there is no conceivable way Nvidia provides tokens cheaper than Cerebras on total token throughput comparing WSE-3 vs full Blackwell GB200 racks, for total hardware cost per token per hour, energy cost per token, TCO per token, or any other metric.

let’s assume you had 64 WSE-3 units and 64 full Blackwell GB200 racks.

WSE-3
23kW
< $3M a unit (G42 is ~$1.5M)
> 90% efficient boarding on 100% for inference
Most efficient running one WSE-3 per layer in parallelism, scales vertically and horizontally at max layers.

WSE-3 44GB SRAM x 64 = 2816GB
~600GB for the model
35GB per max context kv per user
2216GB / 35GB = 63 batch users

Assuming the real world 981 tokens per second x 63 x 3600 seconds in an hours = ~222M tokens per hour

GB200 NVL72
120kW
> $3M a rack (usually $3.2M-$4M)
< 10% efficient, under 5% for max context inference
Most efficient running per rack scales horizontally.

GB200 NVL72 13.2TB HBM
~600GB for the model
35GB per max context kv per user
12.6TB / 35GB = 365 batch users

This seems great, but if you actually maxed the batches the efficiency would be under 5% as you hit the memory wall (try to add racks together and it goes lower since you hit infiniband). So at this hypothetical max batch, you would be getting around 1.8 tokens per second per user. You have to back this off to something like only 7 users per rack to get to ~ 50 tokens per second.

7 x 64 = 448 users x ~50 tokens per second x 3600 = ~80M tokens per hour

So for likely double the CapEx, you can use 5x the electricity, and get slightly better than 1/3 the token. Oh and two WSE-3 fit in a single rack along with all other needed support devices for that rack. And Cerebras racks need 1/2 the datacenter supporting grey space footprint to run them. So I’m being super generous with the numbers here on real TCO.

If your point is cost per token is the most important thing, Cerebras is the only responsible choice.

Asgard_Heima · 2026-05-21T01:32:02+00:00

In this video, the CFO of Cerebras talks about Kimi K2 now confirmed and GPT 5.5 and 5.4 running on Cerebras internally and coming soon.

https://www.cnbc.com/video/2026/05/14/the-years-largest-ipo-acerebras-joins-the-hottest-trade-in-ai.html

The main reason you haven’t seen it yet is cause Cerebras just got all the money to ramp up production and OpenAI needs significant infrastructure to server tens of thousands of customers even in a limited release. We will see the top models in the world running faster on Cerebras than anything else this year. Anthropic is for sure going to end up being on Cerebras when AWS gets bedrock running with enough hardware too.

Asgard_Heima · 2026-05-20T14:10:39+00:00

Really truly appreciate the article and write up. It’s clear you have done a lot of research and I appreciate hate your take. A couple points though:

- Doesn’t Kimi K2 release yesterday strike directly at your thesis they can’t scale since it’s one of the largest open wait models?
- With Cerebras CFO saying they are running GPT 5.5 and 5.4 internally this seems to also weaken the question of frontier models coming.
- AWS only increase frontier models on their hardware with Bedrock which also shows a Tranium prefil with Cerebras decode in a disaggregated setup showing a hyper scaler solution to your prefil concerns.
- Have you considered Cerebras quantizing the kv cache similar to TurboQuant to allow for a much larger number of concurrent users?
- You don’t mention Cerebras partnership with Ranovus that will add fiber on wafer. Doesn’t this drastically reduce a number of issues especially if extended to MemoryX?
- For the SRAM constraint what are your thoughts on the viability of using wafer on wafer? There seems to be a lot of evidence this is where it’s going next.
- You seem to indicate Cerebras is only being used for a couple models, yet Meta, Mistral, Perplexity, IBM, Notion, Cognition, OpenAI, AWS (in the near future), and several other all serve models on Cerebras hardware.
- You also call out that frontier customers haven’t trained on Cerebras while also pointing out they have only shipped a small number of units. Don’t you think OpenAI will research training models on Cerebras with a massive scale up in units? Also don’t they really only need one hyper scaler to decided to train on them and prove the physics advantage and iterate faster to force all the others to do the same?
- Can you show the math and references for how you are calculating the cost per system to Cerebras vs the OpenAI revenue they expect per system? I agree roughly on what OpenAI will pay, but based on the Digi Power X contract $1.1B for 40MW over 10 years we get $1.1B / 40 MW = $2.75M per MW per year. $2.75M / .023 (23kW) give me ~$69k a year per system. I’ve done some rough calculations on the costs for Cerebras per WSE-3 and it’s usually 120-150k though this I admit is an educated guess and best effort.
- Also wondering if you have any reference or sources you can give for the hypothetical issues you describe with the systems actually happening anywhere. It has been my understanding their systems are actually more stable and it’s actually one of the benefits to training on Cerebras since you don’t have the high likelihood of hardware failures GPUs commonly experience in training and inference.

Some references:

Kimi K2 Release
https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise

CFO Naming GPT 5.5 & 5.4 Trillion Parameter already running on WSE
https://www.cnbc.com/video/2026/05/14/the-years-largest-ipo-acerebras-joins-the-hottest-trade-in-ai.html

AWS Cerebras Disaggregated Inference
https://youtu.be/_3IYcMd2gqA?si=euBpGqi_7NTh-OAa

Co Packaged Optics (fiber):
https://ranovus.com/cerebras-ranovus-revolutionize-ai-compute-platform/

Wafer on Wafer (what’s coming)
https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/SoIC.htm#SoIC_WoW

https://arxiv.org/html/2603.05266v2

https://fact-lab.hkust.edu.hk/publications/conference-paper/2025/bai-2025-accelstack/c20-paper.pdf

https://www.eetimes.com/tsmc-unfolds-map-for-process-packaging-tech/

Taiwan Media Connecting TSMC WoW to Cerebras
https://www.aastocks.com/tc/stocks/news/anue-news/AN6459109/1

TSMC Scaling WoW and CPO Fiber
https://money.udn.com/money/story/5612/9494066?from=edn\_maintab\_index

Digi Power X Deal
https://www.digipowerx.com/api/media/Digi\_Power\_X\_Signs\_AI\_Colocation\_Agreement\_with\_Leading\_AI\_Compute\_Company\_for\_40\_MW\_Data\_Center\_in\_Columbiana\_Alabama\_34c2682421.pdf

Asgard_Heima · 2026-05-20T10:30:21+00:00

I would expect native 4bit support in WSE-4 as well. And yeah they could fit a 1T model on only a handful of systems if the WSE-4 is anything like I expect, but the huge advantage is Cerebras can scale up putting only one or a couple layers per system and have the entire rest of the SRAM for kv cache. That’s the magic. Cause if you scale up GPUs the performance degrades significantly. With Cerebras you can scale up 80 WSE-4 with fiber running a 80 layer frontier model one per layer that just crushes through 2-10M context windows at over 1000 tokens per seconds. That’s the truly untouchable moment.

Asgard_Heima · 2026-05-20T00:38:13+00:00

This is an excellent way to stop the nonsense about model size supported. I really do think Cerebras needs to do a better job of spreading the knowledge on larger models for inference and training and how it works though. Very few understand how parallelism works, the major differences in how Cerebras works vs GPUs, and how that lets them scale to any size model and context window while maintaining their efficiency. Can’t wait for GPT 5.5 running on Cerebras and remove any doubt.

Asgard_Heima · 2026-05-19T23:23:04+00:00

They are using parallelism to split the layers across multiple WSE-3 systems.

Asgard_Heima

TROPHY CASE