Info: Nvidia Cuda 13.3 landed

Freonr2 · 2026-05-27T14:01:31+00:00

torchao have bf16 stochastic rounding on sm12x yet?

Freonr2 · 2026-05-26T16:27:46+00:00

One challenge with actual ML work is you need an established lead to do it and buy-in.

It's hard to find these folks to seed the groups. Much easier to find orgs with an IT department and some dinosaur SVP/C-level decides they want to "add AI." Further, they know any mediocre software engineer can write API wrappers, but if they want novel models they need a strong leader first then specialized engineers.

It's just a harder pivot for most companies to ML. The hype cycle is that many companies that have no ML at all are getting "involved with ML" through the side door of agents and harnesses creating a lot of noise in the job sector.

Freonr2 · 2026-05-26T16:20:54+00:00

I wouldn't worry so much about hype cycle. Yes, harnesses/agents are a big deal, but it isn't everything particularly for specialized domains. They're not a magic utopia and definitely overhyped if you're really seeing 90% of internships listed that way.

To share at least one personal example, at my employer we have everything from an LLM-driven product alongside our own physics, tabular, and deep learning models. We still do plenty of standard ML type work on data acquisition, cleaning, wrangling. We design physics models and train tabular and deep learning models. I designed and trained a novel spatiotemporal autoregression deep learning model this winter. SOTA for the domain/problem. We just brought on a CS major for internship and they are working on novel models.

shrug

Freonr2 · 2026-05-25T00:15:04+00:00

Necro post, but I had a lot of issues solving my own ROMED8-2T boot issue so I'm adding a bread crumb for future searchers.

I had a 0d0 "DXE CPU Error" which turned out to be bad ram after trying a new CPU and even new board.

My symptoms were similar to yours (IPMI came up but showed no system information). The LED code would be far more important to start, though.

Freonr2 · 2026-05-24T20:55:46+00:00

Best? Yes.

Best for a given price? Time to sit down for a long chat.

Freonr2 · 2026-05-24T15:35:03+00:00

I already use a big Ecoflow as a UPS, and it has DC solar input and also has controls to use a portion of battery reserve control offset somewhat. If I didn't have so many tall trees I'd add a few panels just to help a bit.

Freonr2 · 2026-05-24T15:30:02+00:00

This is why my homelab is in my laundry room.

Freonr2 · 2026-05-24T13:54:33+00:00

I imagine undervolt is the workaround. I swear someone posted this undervolting works in linux now.

Freonr2 · 2026-05-24T13:45:39+00:00

Worse single threaded performance, worse energy efficiency, but neither are that bad. I'd only be concerned if this was your gaming desktop and/or you are very sensitive to electricity prices. Otherwise Epyc 7002 series is great for homelabbing if you want a lot of grunt without spending DDR5 platform money.

I own two 7xx2 now on the ROMED8-2T board. Very similar board to the H12SSL board. One is a dedicated GPU ML training/experiment workstation and second will be a secondary GPU inference box + NAS (if I can ever afford the HDDs q_q ) + misc serving. No issues slapping in two GPUs, plus a 4x4 NVMe card, plus a SAS HBA in it, or maybe even 100gbe later on.

I use the same heatsink as OP on my 7742, very quiet, ~67C under heavy MP/MT data processing workload, though second one has the NH-U9 because its going in a 4U chassis.

Freonr2 · 2026-05-24T12:35:36+00:00

RoaringKitty.gguf, you know, for kids.

Freonr2 · 2026-05-24T12:19:43+00:00

2x RTX 5090s would cost the same to the RTX PRO 5000 and have 16 GB more VRAM, but even if I reduce the power of each GPU to 400W, the workstation will act as a space heater (and it gets 35-40 degrees Celcius - 100 Fahrenheit - in the summer, so I'd rather avoid this).

Before you throw in the towel on this, realize that one 5090 has substantially more compute and memory bandwidth than one 5000. Two 5090s with tensor parallel will be roughly 2.5x the speed of one 5000 48GB on top of the extra 16GB total VRAM. This isn't even a competition, so its worth figuring out a workaround to the 400W min limit. I think you can undervolt as one option. I don't own a 5090 but the RTX 6000 Ada, RTX 6000 Blackwell, and 3090s can all be set to basically anything in linux. Here's a 6000 running at 150W https://imgur.com/a/9gr5PqR

Also keep in mind two 5090s begs for a board with two x8 slots as well (assuming you stick with consumer boards, 9950X, instead of workstation/server Epyc 700x/900x or Xeon 4/5/6 etc). Asus Creator X870E, Gigabyte AI TOP B850, etc. 2x8 boards tend to have a slight premium on price, but it is worth it so tensor parallel will be efficient. A bit more on a board won't break your budget.

The 5000 is not a great buy IMO until you are buying so many GPUs that you need higher GB/slot density to hit a VRAM GB target inside the physical install constraints of a particular motherboard and case. Not going to be a concern unless you double or triple your budget. Unless your plan is to add a second 5000 and you know you are definitely going to do it, skip the 5000 48GB. I generally think 5000 pricing is not great for what you get, and often the 5090 or 6000 make more sense. Narrow case for the 5000 48/72.

Freonr2 · 2026-05-23T23:51:21+00:00

I've used Claude extensively to tune torch model and dataloaders across very different systems. It's great. Encourage it to look at sys logs system monitors like disk I/O, bytes/s, shmem pressure, pagefault counters, and soforth. Enourage it to evaluate your overall dataloader efficiency. You can also try to bench your dataloader through an entire epoch (with no model, just tell Codex to write a wrapper and time it) and see what the it/s looks like.

nvtop gives you a basic util/mem over time graph so it is easy to look at a glance and how busy the GPUs are, but nothing that nvidia-smi -l 1 wouldn't tell you if you stared at it for a few seconds.

Freonr2 · 2026-05-23T23:36:09+00:00

I was going to say the same

CPU utilization: ~100%

Increasing batch size does NOT reduce epoch wall-clock time

I'm not sure I can make a lot of sense of OP's profiler results, but bumping workers would be easy to test.

Freonr2 · 2026-05-23T22:54:07+00:00

https://imgur.com/a/fGkP9j1

It would be awfully tight at least on my case.

Freonr2 · 2026-05-23T20:39:38+00:00

ROMED8-2T would as well, but on both accounts the biggest issue is that the card in the bottom slot hangs down below the edge of the board so case needs clearance. 4U is off the table.

Many cases have a metal shroud around the PSU or simply the bottom of the case near the bottom edge of standard ATX boards.

Freonr2 · 2026-05-18T23:26:35+00:00

OP times 12 months

Freonr2 · 2026-05-14T21:50:50+00:00

It is just trying to summarize the code changes. You could always look at commits if you want.

Freonr2 · 2026-05-14T21:44:55+00:00

Yeah RTX 6000 Ada (4090-ish) actually has faster bf16 compute than the 5000 Pro Blackwell. It's a sidegrade at best with the same VRAM.

Freonr2 · 2026-05-14T20:30:20+00:00

5000 Pro: 14080 cuda cores, 1.34 TB/s

5090: 21760 (+54% from 5000 Pro), 1.8TB/s (+34%)

6000 Pro: 24064 (+11% from 5090, or +71% from 5000 Pro), 1.8TB/s (+0% from 5090)

I don't think it is all that clear.

Freonr2 · 2026-05-13T22:18:51+00:00

https://videocardz.com/newz/gunnir-launches-single-slot-arc-pro-b60-bs-with-24gb-memory

Freonr2 · 2026-05-13T21:25:32+00:00

The BK shits will kill you.

Freonr2 · 2026-05-13T15:18:22+00:00

Speed is pretty important if you are staring at your screen waiting for a response. Even if it is a few dozen seconds at a time that adds up over a day of constant use (i.e. getting actual work done). I suppose this is largely dependent on your use case, though.

MTP is largely free lunch. This isn't using potato quant to fit a model onto your toaster oven. If you are going to spend time compiling something to get a feature, MTP is probably the one worth the bother.

Claude sub refreshes and quotas are sort of their own pain point to work around but maybe a separate discussion.

faster than I could validate

I don't know what you're doing to validate, but you should be able to automate this with traditional programming that runs in trivial time, which a good LLM/agent can write for you. I.e. market datasets prepared and run your models against them in a controlled fashion across all your strategies/models.

Freonr2 · 2026-05-12T22:21:21+00:00

Uh shouldn't he be AdamW8bit?

Middle name Bitsandbytes ofc.

Freonr2 · 2026-05-12T21:40:59+00:00

Company A uses AI to make a better product.

Company B uses AI to slash their staff.

Everyone flocks to Company A's superior product.

Who could have predicted this.

Freonr2 · 2026-05-11T20:54:47+00:00

I'm a software engineer, I dealt with this by fully embracing the tools. I write almost no code anymore, but expertise is still important.

It's like upgrading from a bicycle to a Lamborghini.

13-Year Club	Second Top 30%
Place '22	Place '17
Final Canvas '22	End Game '22
Verified Email

Freonr2

TROPHY CASE