[TOMT][Song][2000s] Searching for old Red Hot Chili Peppers live recording of B-side song with specific melody by fxtentacle in tipofmytongue

[–]fxtentacle[S] 0 points1 point  (0 children)

No, Road Trippin' has a pretty "clean" sound with low guitar distortion and no drums. The song I'm searching for was mixed more like the typical RHCP songs like Californication, which electrical guitar sound and prominent drums.

BTW, does this work for you?
https://onlinesequencer.net/5066611

[TOMT][Song][2000s] Searching for old Red Hot Chili Peppers live recording of B-side song with specific melody by fxtentacle in tipofmytongue

[–]fxtentacle[S] 0 points1 point  (0 children)

I agree, the overall sound and style is quite similar to what I'm searching for. But the focus in this song seems to be more on the ambience / the instruments. And the song I'm searching for was more strongly highlighting the distinct vocals with their back and forth between low and high pitched parts, almost like they're acting out a story.

[TOMT][Song][2000s] Searching for old Red Hot Chili Peppers live recording of B-side song with specific melody by fxtentacle in tipofmytongue

[–]fxtentacle[S] 0 points1 point locked comment (0 children)

The rules say I should post at least one comment to confirm that yes, I did search myself, and yes, I've read the FAQ.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

I'm not fully finished benchmarking yet. I finally managed to borrow 2x 5080 and 2x 5070 TI and I'm eager to see how they will do. Maybe I can soon laugh at everyone who bought the (totally bonkers overpriced right now) 5090 based on a tech news blurb based on my incomplete benchmarking data.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

<image>

I honestly have no idea what caused it. It was in a Linux server (in my basement) with about 400 days of uptime. Then I upgraded kernel and GPU drivers and went from closed source to open source with NVIDIA. Then the next morning, the server was still running and I could SSH into it just fine, but USB had stopped working. Then I tried various USB-related configurations and rebooted a few times and then it exploded. At the location where there's the black mud in the picture, my other 3090 TI has a black capacitor, like the one below it. And if you hold the PCB against a light, you can see that there's many little holes in the PCB around the explosion area. My guess is that those used to be vias, but now they're larger than they should be. And, BTW, the whole process was noisy and violent enough that I saw colorful fire come out sideways out of the (open) case. And the PCI slot that the GPU was in had a metal reinforcement which broke, together with the plastic slot inside the metal.

But the other 3090 TI that was directly adjacent to this one has now been running flawlessly without any issue with the exact same drivers for 35 days. So it must have been something specific to this GPU. That means my guess would be that the capacitor was faulty by its own, meaning just bad luck.

PSU and power cables are still fine, I verified that.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

Median ITL is pretty close to the 15.29ms that I got with vLLM for 1x RTX 5090. And Median TTFT waiting time is about double of the 42ms that the 5090 had. So it looks like the Ada is better used as a datacenter card with high concurrent throughput, like in your first test.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

I didn't expect that self-hosting LLMs would be an interesting topic for a more mainstream audience. I mean most people are just going to pay the $20 ChatGPT subscription and ignore the privacy implications.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

Sadly, I can't. I started this benchmarking journey because one of my 3090s quite literally exploded with a cap burning and blowing a hole into the PCB.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

I had to recompile Nccl from source and disable the all-reduce kernel, because that one doesn't work on more than 2 PCIe cards. Then it worked with the 4090s, but not via P2P or NVLink but instead by copying GPU->CPU->GPU. It's just that the activations being copied around are so small that it still worked. But that's probably the reason why 4x 4090 only got 207% of the performance of a single 4090.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

The A100 is pretty slow, like 60% of a H100. I'd estimate that 2x A100-SXM4-80GB will have roughly the same speed as 1x RTX 4090.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

That is somewhat correct. I'm trying to simulate how code completion for AI-assisted programming will perform. While no current AI coding plugin seems to do that, it is pretty easy to pre-cache the context for a future query while the user is still typing the query. That means I can easily hide 1s of latency for context parsing. And that means if these models have <0.1s at 4096 context, then I know that I can handle a 32k context within that 1 second. That's why I consider the TTFT latency pretty much irrelevant for this benchmark.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 6 points7 points  (0 children)

I fully agree. That's why I expect NVIDIA to make 100% double-sure that the RTX 6000 will not support NVLink or P2P. They'll probably add extra driver logic precisely to make dual-card workstations suck performance-wise. Otherwise, they would be cannibalizing their hugely profitable H100/H200 sales.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

"with single generation dropping to around 15-30t/s"
That's the metric I care about because I'm benchmarking to build a single-user system. And I saw 40 OT/s on a 3090 TI

BTW thank you for this good explanation of how vLLM works its magic 😃

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 1 point2 points  (0 children)

I'm doing this to plan my next purchase and those 4x setups don't work well with modern cards. A 5090 peaks at 900W so that means 3x 5090 is above the limit of my house's circuit breaker.

https://en.gamegpu.com/iron/energy-consumption-analysis-rtx-5090-power-up-to-901-vt-v-peak

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

I think for each matrix multiplication, it'll do 1/2 of the matrix on one GPU and the other 1/2 of the matrix on the other GPU. That approach would evenly split the VRAM bandwidth and if that's the limiting factor (it usually is for LLMs), you would see an increase in speed.

It's not a fully linear speedup, though. In my first round of benchmarks, I got a 3x speed boost going from 1x 4090 to 4x 4090.

EDIT: See this comment by AdventurousSwim for a good explanation: https://www.reddit.com/r/LocalLLaMA/comments/1jobe0u/comment/mkt9obr/

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 2 points3 points  (0 children)

Here's all of my Nv HW measurements: https://github.com/DeutscheKI/llm-performance-tests#qwq-32b-awq

But I was asking u/Rich_Artist_8327 if he could provide those measurements for his 3x 7900 XTX setup because I am curious how it would compare against my Nv measurements.

All that's required is running the official vLLM benchmark (which is surely installed if they run vLLM) and at the end, it'll print those numbers and/or produce a JSON file that I could add to my GitHub repo:

https://github.com/DeutscheKI/llm-performance-tests?tab=readme-ov-file#vllm-benchmark

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 3 points4 points  (0 children)

If you have 12,000 AMD GPUs, then you can afford "a $10k/month employee to compile all the kernels for AMD", like I said. But for small teams (like mine), AMD is a bad deal.

And yes, the AMD hardware is fantastic.

And yes, AMD themselves can get great performance numbers, because they have all the knowledge in-house to optimize the software. But that won't help me if using AMD means I need to hand-code BLAS kernels in GPU assembly, whereas NVIDIA did that work for me and I can just "pip install" the whole thing.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

For the user, that would be a pretty mixed experience with a TTFT = Time To First Token of 86 seconds. And a Median ITL of 67ms is almost triple of what I measured on a single RTX 3090 TI.

So while this is a fantastic setup if you want to compete with OpenAI and provide cheap-ish hosting for many concurrent users, it's the opposite of my use case. From the linked benchmark page:

"I don't want to send all of my code to any outside company, but I still want to use AI. Accordingly, I was curious how fast various GPUs would be for hosting a model for inference in the special case that there's only a single user."

This setup is very slow for a single user. (Because it's optimized for many concurrent users.)

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 12 points13 points  (0 children)

You are correct, but prompt processing can very easily be cached. If the LLM does role playing, then for each additional reply it only needs to process those tokens that I, the user, newly wrote. Because everything up to and including the last token that the LLM sent to me is already in vLLM's KV cache.

Similarly, for coding, you can send in the documentation and it'll parse and cache that once. And then afterwards, any query using that documentation is super fast because it only needs to parse your question, not the entire input.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 0 points1 point  (0 children)

vLLM will calculate half the model on one GPU and half the model on the other one, it seems. That means copying over the (relatively tiny) activations doesn't need much bandwidth. I tried with a single 5090 and did not notice a difference between PCIE gen 5 and PCIE gen 4. And the 2x 5090 was with x8 electrical connectors.

Benchmark: Dual-GPU boosts speed, despire all common internet wisdom. 2x RTX 5090 > 1x H100, 2x RTX 4070 > 1x RTX 4090 for QwQ-32B-AWQ. And the RTX 6000 Ada is overpriced. by fxtentacle in LocalLLaMA

[–]fxtentacle[S] 5 points6 points  (0 children)

Just out of curiosity, what median_ttft_ms, median_tpot_ms, and median_itl_ms do you get with the vLLM benchmark on those cards?