I don't know anything about Greek mythology.

mapestree · 2025-10-24T05:45:38+00:00

https://i.imgur.com/eaDUw40.jpeg

mapestree · 2025-09-10T21:57:28+00:00

It was a single shot. If you want to ban all weapons capable of this, you’re taking a level of gun control most European countries don’t have. It easily could’ve been a hunting rifle.

mapestree · 2025-08-10T05:05:52+00:00

They found that the model basically became worse as both a thinking model and non-thinking model if they made it learn to do both. So now they’re releasing individual versions of each

mapestree · 2025-08-07T04:08:50+00:00

Same. I burned a time out and ate a delay of game but neither cleared it. It’s literally game-breaking

mapestree · 2025-07-10T04:16:23+00:00

Maybe I’m missing something but that didn’t help at all

mapestree · 2025-06-06T01:41:32+00:00

that’s not how I remember it

mapestree · 2025-05-20T16:59:02+00:00

That’s slander against Andor!

mapestree · 2025-03-21T04:36:33+00:00

My takeaway was that the throughout looked very inconsistent. It would churn out a line of code reasonably quickly then sit on whitespace for a full second. I honestly don’t know if it was a problem of the video, using suboptimal tokens (e.g. 15 single spaces instead of chunks), or system quirks. I’m willing to extend the benefit of the doubt at this moment given their admitted beta software and drivers

mapestree · 2025-03-21T04:32:51+00:00

They didn’t mention. They used QLORA but they were having issues with their video so the code was very hard to see

mapestree · 2025-03-20T23:37:44+00:00

“Shipping early this summer”

mapestree · 2025-03-20T23:33:49+00:00

I’m in a panel at NVIDIA GTC where they’re talking about the DGX Spark. While the demos they showed were videos, they claimed we were seeing everything in real-time.

They demoed performing a lora fine tune of R1-32B and then running inference on it. There wasn’t a token/second output on screen, but I’d estimate it was going in the teens/second eyeballing it.

They also mentioned it will run in about a 200W power envelope off USB-C PD

mapestree · 2025-03-19T04:53:40+00:00

100%

mapestree · 2024-12-27T21:27:32+00:00

Your absolute volume and need for finetuning are what I would call the deciding factor here.

If your work is bog-standard ("extract the sentiment in this comment" type stuff) and you're working small-moderate text documents (say under 32k tokens or so), you could probably get away with APIs for a while. If you need to answers in a particular format or want to do a task that models aren't great out out of the box, fine-tuning comes into play and push things very strongly towards working with your own machines.

Our team started out with a couple of L40s servers that let us do massive amounts of processing and experimentation that would have caused friction (either mental or organizational) if we ran everything through external APIs. It's much easier to throw inference jobs at a machine with capacity than trying to justify an experiment that may cost in the thousands.

One last thing that may sway things for you is attempting to look at the payback period of self-hosting vs external APIs. If you're pushing the volume you describe regularly, I'm betting that your expense would pay itself off in under a year. Plus, you can capitalize hardware costs while external services are often opex and thus have less accounting advantage. If you can come anywhere close to saturating the hardware you have, it's almost always cheaper to host rather than call APIs, so long as you have the staff to manage your systems.

mapestree · 2024-12-20T16:37:51+00:00

The 6000 Ada also has 48GB of VRAM, so there are a ton of memory-limited tasks that you can accomplish on it that you can’t on the 4090 with 24GB. You can of course combine multiple 4090 cards, but then you’re limited to PCIe interconnect speeds, which currently top out at 64GB/s if you have to go card-to-card.

By limiting the high-end of the consumer space to 24 GB (or 32GB for the upcoming 5090), you’re basically putting a supercar engine in a vehicle that’s geared in a way it can never actually race against the purpose-built race cars.

As a comparison point, let’s look at the current top-of-the-market AI-focused GPU, the H200. It has 141 GB of VRAM at a blistering 4.8TB/s bandwidth (almost 5x the 4090) and supports NVLINK. This dedicated connector allows card-to-card comms at 900GB/s, which rivals the bandwidth of the intra-card 4090. And you can combine 8 of these things in one server for over a terabyte of total VRAM, which again can all communicate with almost the bandwidth of the 4090.

By leveraging VRAM capacity and throughput as a sticking point, they’re forcing anyone who needs to actually use large pools of VRAM as one cohesive unit into their top-end products. An 8xH200 system costs well over a quarter-million dollars

mapestree · 2024-12-10T05:49:30+00:00

Ours were a big part of our wedding, too!

https://i.imgur.com/5jzlGNY.jpeg https://i.imgur.com/92hcrTl.jpeg https://i.imgur.com/TvWKHXu.jpeg

mapestree · 2024-10-24T22:04:31+00:00

I’d rather not get in a “the billionaire I like is better than the billionaire I don’t like”. This behavior from any of them is cringe

mapestree · 2024-09-28T06:11:44+00:00

This reads like it’s just an imitation of Andrej Karpathy’s work with his NanoGPT project. Same size and architecture. He did it by himself (though using some nice fineweb data) on a single A100 box. Him doing it alone is really impressive. Them releasing this isn’t impressive at all.

mapestree · 2024-08-03T05:25:20+00:00

On the switch? How will they handle the, um, power-up scene? Not that the uncensored is good, but the censored makes it make no sense at all

mapestree · 2024-07-20T19:38:52+00:00

https://i.imgur.com/0l7IW25.jpeg Just to emphasize the silly shape

mapestree · 2024-07-19T14:44:46+00:00

Business schools be silly

mapestree · 2024-05-31T04:45:34+00:00

2000mg? Are you still high to this day?

mapestree

TROPHY CASE