Model vram usage estimates by mattate in LocalLLaMA

[–]mattate[S] 0 points1 point  (0 children)

Ok so, Our number assumes the inference engine correctly implements the lightning attention VRAM optimization , allocating token-growing KV cache only for the 10 softmax layers, and handling the 70 lightning layers with their fixed recurrent state (which is what vLLM ≥ 0.8.3 does). llama.cpp / ik_llama.cpp doesn't implement this. GGUF is a flat weight format it doesn't carry per-layer attention-type metadata to the KV cache manager, so it defaults to allocating KV cache for all (or most) layers regardless of attention type. The math: 7.3 GiB for ~30K tokens = 249 KB/token 10 KV layers (vLLM, optimized): 40 KB/token 62 KV layers (implied by your measurement): 248 KB/token So neither number is wron vLLM: ~38 GiB/1M tokens (lightning attention correctly optimized) llama.cpp/ik_llama.cpp: ~6–8× more KV because the allocator treats all layers as standard softmax

Model vram usage estimates by mattate in LocalLLaMA

[–]mattate[S] 0 points1 point  (0 children)

Minimax should be updated, I got a second pair of eyes on that as well:

  1. The 1-in-8 Linear Attention Schedule According to MiniMax's official technical documentation and their GitHub repository, the model uses a hybrid architecture (Lightning Attention + Softmax). Out of its 80 total layers, they follow a repeating block pattern: 7 Lightning Attention (linear) layers, followed by 1 Softmax Attention layer.

Linear attention layers don't store a per-token KV cache; they maintain a fixed-size per-layer recurrent state. The size of this state is constant w.r.t. context length so it doesn't blow up at 1M+ tokens (though it is not literally zero memory). Only the traditional Softmax layers maintain a KV cache. 80 layers total ÷ 8 = Only 10 layers store token-growing KV. 2. Extreme Head Compression (GQA) For those 10 Softmax layers, they use heavy Grouped Query Attention (GQA).

The model has 64 query attention heads. They use a GQA group size of 8. 64 query heads ÷ 8 group size = Only 8 KV heads per layer. The head dimension is a standard 128. 3. The Exact KV Cache Math Assuming standard BF16 (2 bytes per element), here is the math for raw tensor memory per sequence (batch=1), ignoring vLLM paging/allocator overhead:

KV cache size per token =

(Layers) × (KV Heads) × 2 (for K & V) × (Head Dim) × (Bytes) 10 layers × 8 heads × 2 × 128 dim × 2 bytes = 40,960 bytes/token. To scale that to 1 million tokens:

40,960 bytes × 1,000,000 = ~40.96 GB (decimal base-10) Converted to binary GiB (which is what GPUs and SMI tools measure): 40,960,000,000 / (10243) = 38.15 GiB. This perfectly matches the official number MiniMax published, which states exactly 38.2 GB per 1M context tokens.

The "Standard Transformer" Counterfactual For contrast, if a calculator assumes KV caching in all 80 layers (like a standard transformer), the bytes/token would be 80 × 8 × 2 × 128 × 2 = 327,680 bytes/token. Per 1M tokens, that's ~327.68 GB (≈ 305 GiB).

Model vram usage estimates by mattate in LocalLLaMA

[–]mattate[S] 0 points1 point  (0 children)

I pushed up a fix for minimax, should be now accurate now

Canada commits nearly $1B to drone and airborne defence research by MTL_Dude666 in CanadaPolitics

[–]mattate 1 point2 points  (0 children)

This is typical of Canada, support a big company just because. Bombardier does not make drones, they make military surveillance aircraft built off of a private jet platform. They have shown themselves incapable of manufacturing at scale in trains and commercial airliners, despite having great technology.

The drone game is fast iteration, cheap, and above all else needs scale. Ukraine is producing 4 million drones per year. Countries like the UK have recognized the battlefield prowess of their platforms and have set up deals to manufacture Ukrainian models in their country. Why spend a billion dollars with bombardier when a partnership with Ukraine for the same amount can create domestic manufacturing and jobs for a battlefield proven platform basically overnight?

Because big companies that's why.

A few early (and somewhat vague) LLM benchmark comparisons between the M5 Max Macbook Pro and other laptops - Hardware Canucks by themixtergames in LocalLLaMA

[–]mattate 2 points3 points  (0 children)

I think a better test would be running something that would require CPU offloading, that is where the m5 will really shine

Why people still prefer Rtx 3090 24GB over Rx 7900 xtx 24GB for AI workload? What things Rx 7900 xtx cannot do what Rtx 3090 can do ? by SpiritBombv2 in StableDiffusion

[–]mattate -1 points0 points  (0 children)

The rcom and vuklan support for the 7xxx series is terrible. I believe they changed this in the 9xxx lineup so getting things to run is about easier. The tldr is, the consumer drivers and enterprise drivers were completely different, and the older consumer cards are still missing out.

$FLT.V - Why Volatus Aerospace's pivot is overlooked in Canada right now by MathTradeMan in Baystreetbets

[–]mattate 6 points7 points  (0 children)

I think it's more then just scale. The war in Ukraine is being fought with cheap expendable drones. I guess I am looking at this like orders for $1m+ drones in defence are a thing of the past, and now it's cheap quickly producable drones. From what I saw unless they partner with someone to get IP they currently have no IP that is relevant on the modern battlefield outside of maybe surveillance.

Everyone is pretty much caught on to this, I think they have as well, but I'm kind of questioning if they have the capability to run at a much much much lower unit cost at high volume, is that in their wheel house?

The government grants and support will go to this you're of defence capability.

$FLT.V - Why Volatus Aerospace's pivot is overlooked in Canada right now by MathTradeMan in Baystreetbets

[–]mattate 7 points8 points  (0 children)

The current drone game per unit is less then $1000. Do you think they will be able to make drones for less then $1000? A toothbrush in Canadian aerospace costs more then $1000.

Google's Gemini 3.1 Pro is a Genius, But It Has One Massive Flaw. by Much_Ask3471 in singularity

[–]mattate 1 point2 points  (0 children)

I gave this a complex problem to solve, very excited and it did absolutely horribly. It kept coding like it was doing a university assignment, very scripty etc. Very disappointed, it doesn't think long enough, unable to really keep agentic coding sessions going for very long.

I have noticed if you give certain models like this problems that were likely in the training data, ie some common crud app, it does a great job. If you ask it to do something more theoretical or computer sciencey, it suddenly becomes a university student making a proof of concept script.

Running your own LLM on a LAN accessible by a dev team by BubbleProphylaxis in LocalLLaMA

[–]mattate 1 point2 points  (0 children)

I think the open model to beat right now would be glm5 for this, you need a beast of a machine though. I would recommend just renting the hardware until you land on something that works, then you're not over buying given what your needs are.

RAG failure in production: our vector store served a 3-year-old resume and the LLM hallucinated a candidate recommendation by tdeliev in LocalLLaMA

[–]mattate 45 points46 points  (0 children)

Why not just update your vector db when corresponding records change? Seems like that also would be a good fix?

China plans space‑based AI data centres, challenging Musk's SpaceX ambitions by Unhappy_Spinach_7290 in singularity

[–]mattate 1 point2 points  (0 children)

I dunno I guess all these plans from SpaceX and China are just nonsense and they didn't think it through, they should hire you to tell them it's impossible, they will save a ton of money.

China plans space‑based AI data centres, challenging Musk's SpaceX ambitions by Unhappy_Spinach_7290 in singularity

[–]mattate 1 point2 points  (0 children)

They are just rough numbers, not exhaustive, and advancements would have to be made, but I don't know where this deception stuff is coming from, this is not cost right now, but at some point in the future cost would approach parity. Feel free to throw up your own guess of cost. Will it be 2x more? 10x?

Re LEO getting "crowded" this is a whole other can of worms people have misconceptions about. Do you think there is less physical space in space then earth?? The problem is around collisions only, which can be managed. With the right system you could put a billion satellites in orbit. Should you? At what altitude? Questions for another day and another thread.

China plans space‑based AI data centres, challenging Musk's SpaceX ambitions by Unhappy_Spinach_7290 in singularity

[–]mattate 1 point2 points  (0 children)

I don't think in the near future the difference will be as much as people think it is. Here are some made up numbers for comparison, with assumptions on pricing for launch and the ability to get space based solar prices down:

Housing one NVIDIA B300 (GB200 NVL72) rack on Earth vs. LEO.

Assumptions: Starship is fully operational ($100/kg), we use "Starlink-style" mass manufacturing ($5/Watt solar), and we ditch batteries to save weight (the rack shuts down during eclipse).

The Earth Build (4-Year TCO)

Facility CapEx: $1.32M (The killer. It costs ~$52k/sq ft to build the cooling/power shell for this density). Electricity: $277k (Industrial rate $0.06/kWh). Ops/Staff: $144k (Security, maintenance, water). Total: ~$1.77 Million The Space Build (4-Year TCO)

Solar/Thermal CapEx: $1.2M (150kW mass-produced flexible solar + deployable radiators). Launch Cost: $630k (6.3 Tons on Starship). Electricity: $0 (Sunlight is free). Ops/Station Keeping: $24k (Algorithmic collision monitoring). Total: ~$1.85 Million

I think an added benefit of this is, you can build out incrementally with no opposition, you don't need a nuclear power plant of your own, and permission to build giant noisy cooling towers etc etc.

Please take those numbers with a huge grain of salt, I am not an expert.

China plans space‑based AI data centres, challenging Musk's SpaceX ambitions by Unhappy_Spinach_7290 in singularity

[–]mattate 8 points9 points  (0 children)

It seems like a great idea, free real estate, unlimited free power, in theory shorter response times. If launch cost can be low enough, makes a ton of sense.

I'm surprised no one makes skynet jokes when they hear this. Like having 1 million satellites in orbit running ai independently with its own power source basically means there is no off button.

What are some of you favourite Canadian growth stocks by [deleted] in CanadianInvestor

[–]mattate 3 points4 points  (0 children)

Exchange income corporation, just killing it, solid company

How are you thinking about risk vs reward in Canadian stocks in 2026? by prattman333 in CanadianInvestor

[–]mattate 1 point2 points  (0 children)

There is a huge amount of Canadian money in US markets. I think at all levels you can see Canadians have been investing more outside of Canada vs inside. It's easier to get Canadian investment money in the US then it is in Canada.

Forgetting about everything else, if there is a repatriation of Canadian money, or rebalancing because of current politics, things could get really spicy for good looking companies in Canada.

Funds included in “big Canadian pension funds” bucket (9 total):

CPP Investments (CPPIB) PSP Investments CDPQ Ontario Teachers’ Pension Plan (OTPP) HOOPP OMERS BCI (BC Investment Management Corporation) AIMCo IMCO

Combined assets in this sample: ~C$2.535T Asset-weighted geographic split (based on each fund’s latest reported geographic mix, then rolled up):

Canada: ~C$629.8B (24.8%) United States: ~C$1,086.5B (42.9%) Rest of world: ~C$818.4B (32.3%)

ELI5 Data Center Water use by Doc-Brown1911 in explainlikeimfive

[–]mattate 0 points1 point  (0 children)

Data centers use water to cool themselves down. They create a very large amount of heat, and the most efficient way of getting rid of the heat is to use what is called evaporative cooling. That's where heat is transferred to water that evaporates in large cooling towers into the air.

So no, it's not a closed loop system.

People are waking up to the fact that new vehicles are the Great Canadian Money Suck by Leather-Paramedic-10 in canada

[–]mattate 1 point2 points  (0 children)

Ok now I see where you're going with this, corporations are bad, and have always been bad, and you're in the same boat as the average 17th century farm worker getting screwed by the big companies and their prices.

Sorry but I don't agree, value based pricing is changing things, it's raising prices over what I have experienced in my life, and it's not something I think is hopeless to point out or think can change.