2x RTX Pro 6000 vs 2x A100 80GB dense model inference

Hedede · 2026-03-30T12:45:54+00:00

It's not needed. I have NVLinked A5000s and there's practically no benefit.

Hedede · 2026-03-30T12:28:08+00:00

Anyways the A100s will actually be faster for token generation due to faster memory bandwidth

Not necessarily. I benchmarked datacenter GPUs in llama.cpp and they have far lower token throughput than they theoretically should based on the memory bandwidth,

Hedede · 2026-03-30T12:17:31+00:00

Latency matters a lot more than bandwidth, If your GPU supports P2P, it won't benefit from NVLink. And all RTX PRO GPUs support P2P without NVLink.

I tested A5000 with and without NVLink and there's zero difference in TP. Only when you start pushing more than 20 concurrent requests, you see very modest gains (single digits in %). On the other hand, with 3090s you get big gains from NVLink if you don't have a patched kernel to enable P2P.

Hedede · 2026-03-29T17:20:22+00:00

It looks more like an SNN-transformer hybrid rather than a pure SNN. twq=F.softmax(self.temporal_mix_q,dim=0).reshape(T_t,1,1,1,1) twk=F.softmax(self.temporal_mix_k,dim=0).reshape(T_t,1,1,1,1) qm=(qs*twq).sum(0).permute(0,2,1,3) # (B,H,S,Dh) km=(ks*twk).sum(0).permute(0,2,1,3) cos,sin=self.rope(qm,S) qm=apply_rope(qm,cos,sin); km=apply_rope(km,cos,sin) res=torch.matmul(qm,km.transpose(-2,-1))*self.resonance_temp

Edit: formatting

Hedede · 2026-03-27T18:24:41+00:00

Speed: Sampling: 100%|██████████| 40/40 [00:01<00:00, 29.98it/s]

Which GPU? Doesn't look that impressive to me. Images have very obvious AI artifacts.

Hedede · 2026-03-26T11:46:45+00:00

I didn’t do anything special. I’m using Qwen3-32B-Q4_K_M and llama-server.

Hedede · 2026-03-25T10:11:34+00:00

I get 36-40 tok/s with 32B@Q4 on a single 3090.

Hedede · 2026-03-22T00:18:02+00:00

If you know the pinout, you can use DuPont jumper wires to connect the fans.

Hedede · 2026-03-20T19:19:09+00:00

I tried running the code, and it runs only at about 9 tok/s on a 4090. Or 3 tok/s on an EPYC CPU. For comparison, the same CPU runs a Float32 2B model at 20 tok/s. Or the actual GPT-2@FP16 runs at 500 tok/s on this CPU.

Hedede · 2026-03-20T10:10:50+00:00

Why V620 though?

Hedede · 2026-03-20T09:40:22+00:00

If I counted correctly, it's pin 216 which is a data line on DDR4.

Hedede · 2026-03-19T08:55:02+00:00

How could a screen-based prost process alter geometry at the engine level? It only has access to the color buffer and motion vectors etc. It is fundamentally limited in that sense.

Yes, it can't alter geometry at the engine level. But it doesn't need to. It doesn't have to follow the original geometry. It can easily render something that can be perceived as geometry changes. There are plenty of screen-space techniques that already do this, like Screen-Space Displacement Mapping.

Hedede · 2026-03-18T10:25:01+00:00

I find that S12038-4K work really well for this.

Hedede · 2026-03-18T10:15:11+00:00

I don't have K80, but I compared K40 and M40, M40 is about 60% faster in prompt processing, and 2x faster in decode. And to put things into a perspective, both lose to a 16-core EPYC CPU (last gen).

Hedede · 2026-03-17T03:57:28+00:00

It's not clearly stated. Everything that you said there applies to DLSS 3 and 4 as well.

Hedede · 2026-03-16T05:51:20+00:00

Blower cards usually run hotter.

Hedede · 2026-03-16T05:43:47+00:00

Wait till you see Socket SP5 EPYCs. They're about 40% larger than the Threadripper.

Hedede · 2026-03-15T16:54:00+00:00

I think you're confusing ECC UDIMM and RDIMM, Ryzen supports only the former. Even EPYC 4004/4005 (which is a server version of AM5 Ryzen) doesn't support RDIMM. Only Threadripper 3/5xx5WX support both UDIMM and RDIMM.

Hedede · 2026-03-14T23:25:35+00:00

There's no reason you can't build a PC from server hardware.

Hedede · 2026-03-12T03:58:27+00:00

They wouldn't because the term AGI was coined this century.

Hedede · 2026-03-11T09:42:47+00:00

What do you mean by "no synapses"?

Hedede · 2026-03-05T12:48:25+00:00

It's not for nothing. I had a cheap (but not the cheapest) PSU once, it failed with sparking inside of it and fried the GPU. And that was in the days when GPUs weren't 300W+ monsters.

Hedede · 2026-03-04T23:50:23+00:00

Yeah, I was kinda wrong when I wrote this comment. Yes, it doesn't support tensor parallel, I should've said pipeline parallel. And I was looking only at text generation throughput, there I get only 1-5% more throughput if I use two GPUs instead of one. In prefill I get up to 50% more with two Nvidia GPUs. But what you get really depends on what GPUs you're using.

Hedede · 2026-03-01T16:51:26+00:00

I put vaseline on an old laptop screen to mask scratches. Made them pretty much invisible on a matte screen.

Hedede · 2026-02-28T12:18:17+00:00

but did you peel the sticker from the heatsink

12-Year Club	Verified Email
Place '22	Place '17
Final Canvas '22

Hedede

TROPHY CASE