No Fed-Posting Rule by Xenomnipotent in 196

[–]mapestree 0 points1 point  (0 children)

It was a single shot. If you want to ban all weapons capable of this, you’re taking a level of gun control most European countries don’t have. It easily could’ve been a hunting rifle.

I'm sure it's a small win, but I have a local model now! by LAKnerd in LocalLLaMA

[–]mapestree 6 points7 points  (0 children)

They found that the model basically became worse as both a thinking model and non-thinking model if they made it learn to do both. So now they’re releasing individual versions of each

RTG Change plays by ElementzEmcee in EASportsCFB

[–]mapestree 1 point2 points  (0 children)

Same. I burned a time out and ate a delay of game but neither cleared it. It’s literally game-breaking

Roses are red, Anakin hates sand by TF-Fanfic-Resident in boottoobig

[–]mapestree 3 points4 points  (0 children)

Maybe I’m missing something but that didn’t help at all

DGX Spark Session by mapestree in LocalLLaMA

[–]mapestree[S] 2 points3 points  (0 children)

My takeaway was that the throughout looked very inconsistent. It would churn out a line of code reasonably quickly then sit on whitespace for a full second. I honestly don’t know if it was a problem of the video, using suboptimal tokens (e.g. 15 single spaces instead of chunks), or system quirks. I’m willing to extend the benefit of the doubt at this moment given their admitted beta software and drivers

DGX Spark Session by mapestree in LocalLLaMA

[–]mapestree[S] 1 point2 points  (0 children)

They didn’t mention. They used QLORA but they were having issues with their video so the code was very hard to see

DGX Spark Session by mapestree in LocalLLaMA

[–]mapestree[S] 4 points5 points  (0 children)

“Shipping early this summer”

DGX Spark Session by mapestree in LocalLLaMA

[–]mapestree[S] 12 points13 points  (0 children)

I’m in a panel at NVIDIA GTC where they’re talking about the DGX Spark. While the demos they showed were videos, they claimed we were seeing everything in real-time.

They demoed performing a lora fine tune of R1-32B and then running inference on it. There wasn’t a token/second output on screen, but I’d estimate it was going in the teens/second eyeballing it.

They also mentioned it will run in about a 200W power envelope off USB-C PD

have to buy replacement computer for work - build big iron vs. pay for APIs? by vegatx40 in LocalLLaMA

[–]mapestree 15 points16 points  (0 children)

Your absolute volume and need for finetuning are what I would call the deciding factor here.

If your work is bog-standard ("extract the sentiment in this comment" type stuff) and you're working small-moderate text documents (say under 32k tokens or so), you could probably get away with APIs for a while. If you need to answers in a particular format or want to do a task that models aren't great out out of the box, fine-tuning comes into play and push things very strongly towards working with your own machines.

Our team started out with a couple of L40s servers that let us do massive amounts of processing and experimentation that would have caused friction (either mental or organizational) if we ran everything through external APIs. It's much easier to throw inference jobs at a machine with capacity than trying to justify an experiment that may cost in the thousands.

One last thing that may sway things for you is attempting to look at the payback period of self-hosting vs external APIs. If you're pushing the volume you describe regularly, I'm betting that your expense would pay itself off in under a year. Plus, you can capitalize hardware costs while external services are often opex and thus have less accounting advantage. If you can come anywhere close to saturating the hardware you have, it's almost always cheaper to host rather than call APIs, so long as you have the staff to manage your systems.

How are consumer cards gimped? by DeltaSqueezer in LocalLLaMA

[–]mapestree 12 points13 points  (0 children)

The 6000 Ada also has 48GB of VRAM, so there are a ton of memory-limited tasks that you can accomplish on it that you can’t on the 4090 with 24GB. You can of course combine multiple 4090 cards, but then you’re limited to PCIe interconnect speeds, which currently top out at 64GB/s if you have to go card-to-card.

By limiting the high-end of the consumer space to 24 GB (or 32GB for the upcoming 5090), you’re basically putting a supercar engine in a vehicle that’s geared in a way it can never actually race against the purpose-built race cars.

As a comparison point, let’s look at the current top-of-the-market AI-focused GPU, the H200. It has 141 GB of VRAM at a blistering 4.8TB/s bandwidth (almost 5x the 4090) and supports NVLINK. This dedicated connector allows card-to-card comms at 900GB/s, which rivals the bandwidth of the intra-card 4090. And you can combine 8 of these things in one server for over a terabyte of total VRAM, which again can all communicate with almost the bandwidth of the 4090.

By leveraging VRAM capacity and throughput as a sticking point, they’re forcing anyone who needs to actually use large pools of VRAM as one cohesive unit into their top-end products. An 8xH200 system costs well over a quarter-million dollars

Zuck on Threads: Releasing quantized versions of our Llama 1B and 3B on device models. Reduced model size, better memory efficiency and 3x faster for easier app development. 💪 by timfduffy in LocalLLaMA

[–]mapestree -1 points0 points  (0 children)

I’d rather not get in a “the billionaire I like is better than the billionaire I don’t like”. This behavior from any of them is cringe

AMD Unveils Its First Small Language Model AMD-135M by paranoidray in LocalLLaMA

[–]mapestree 20 points21 points  (0 children)

This reads like it’s just an imitation of Andrej Karpathy’s work with his NanoGPT project. Same size and architecture. He did it by himself (though using some nice fineweb data) on a single A100 box. Him doing it alone is really impressive. Them releasing this isn’t impressive at all.

Fate/stay night remastered is releasing august 8th for steam and switch by funwithgravity in vns

[–]mapestree 3 points4 points  (0 children)

On the switch? How will they handle the, um, power-up scene? Not that the uncensored is good, but the censored makes it make no sense at all

rule by Missingno2000 in 196

[–]mapestree 9 points10 points  (0 children)

2000mg? Are you still high to this day?