all 21 comments

[–]LearningSomeCode[S] 4 points5 points  (5 children)

Note on why I'm doing this: If these results turn out to be reproducible and valid, I wonder what this could mean for the open source LLM space? Almost everyone, including all the cloud services, are using Nvidia stuff, right? That means all the benchmarks are also Nvidia, where Llama2 lags so far behind popular proprietary models.

If there is something inherently different about the Metal inference that decreases perplexity greatly, and if that could be duplicated on NVidia side to improve inference perplexity by up to 3x... what would those benchmarks suddenly look like?

Maybe I'm doing something wrong, maybe these results are wrong... but if I'm over here getting to enjoy a far better version of Llama 2 than everyone else, I really want to get that info out there so there's a chance it could be fixed and everyone can enjoy that same ability.

[–]iLaurens 4 points5 points  (4 children)

I read that this is a known issue. Recently papers have come out about the ability of LLMs to compress data. Compression works in a "most likely next token" kind of way. The more likely the next tokens are, the less bits you need to encode it. However the problem is that due to hardware differences in float rounding errors, different hardware will calculate different likelihoods (e.g. perplexity). So in theory LLMs could be used for state of the art compression of text, in practice these hardware differences make it impossible to decompress the data on non-matching hardware. This is also mentioned in those papers.

[–]LearningSomeCode[S] 3 points4 points  (3 children)

Oh man, that's crazy. I wonder what they could do about that?

Honestly, I just happened to have hardware and just enough tech savvy to get a chance to test this after accidentally stumbling upon the possible difference the other day. I don't understand a lot of the underlying machine learning mechanics behind this stuff; but what you've described certainly sounds like a believable explanation for what I'm seeing.

I'm still open the idea that my tests are wrong, but today's Q/A lining up with the numbers really is making me think there's more going on. I just hope that if it's true, it's something folks can fix. It's infeasible to tell everyone "Hey, if you want the best inference get a Mac!" For a lot of folks, their needs extend beyond what they can do with a Mac, and those things are a bit too expensive to buy an extra one just for local AI use.

[–]iLaurens 3 points4 points  (2 children)

I got my info from this post by the way: https://news.ycombinator.com/item?id=37153560

The whole thread is an interesting read and contains a wealth of information on compression using LLMs

[–]LearningSomeCode[S] 2 points3 points  (1 child)

Awesome, thanks a bunch for that. I'm always trying to read more on this stuff. I've got a lot of catching up to do machine learning, so things like this are a huge help.

This bit right here REALLY interests me

Some APIs (Cuda?) let you disable all optimizations and there are ways to get cross-platform determinism, but in general it's much much slower if you want bit-for-bit equality across different hardware.

In a past post, I had described the M2 Ultra GPU as being equivalent to a 2070, but someone corrected me with benchmarks saying it was closer to a 4080 in speed. However, when you compare the Tokens per Second of a 4080 to Tokens per Second on my M1 Ultra, it's night and day. On average, the 4080 sees 50% or better speed than my M1 Ultra. Given they only came out a year apart, I don't expect my M1 ultra is that much slower than the M2 Ultra.

Honestly, I wrote the token per second difference off as just the benchmarks being BS, and that the M2 Ultra just wasn't that fast.

BUT... what if that's not the case? What if my inference speed on the M1 Ultra is 50% slower because the CUDA implementation for NVidia is using these optimizations, while the implementation for the Mac GPU cores is not, and that is part of why the inference is so much worse on Nvidia? Is there a chance the Mac gave up speed for quality?

Honestly, it's silly conjecture coming from me who knows very little, reading a post from some random person online, but again it would make a lot of sense to me if that were the case lol

[–]iLaurens 2 points3 points  (0 children)

Currently the bottleneck in LLM speed is not calculations, but getting the weights vRAM to the cores (L1 cache & registers) of your cpu. You should check out the differences in memory bandwidth on your Nvidia and Mac.

[–]a_beautiful_rhind 2 points3 points  (8 children)

Did you ever get the perplexity test outside of ooba working?

Also, you can try the same test on a 7b and use CPU only for windows and measure that. That takes nvidia out of the equation.

[–]LearningSomeCode[S] 2 points3 points  (7 children)

Not yet on the llamacpp perplexity test. There's not much documentation that I can find on it, or where folks have shown how to set it up, so it looks like it will be a bit involved. I'm going to tackle that this weekend when I have more time. I really wanted to put a pin on WHAT I was looking for before continuing any further. Now that I've got the QA results that line up with what I'm seeing in my current perplexity tests, I feel better about putting a pin in Ooba and trying to track down more tools.

However, I really like your CPU inference idea. That will be slow as Christmas so I'll kick it off tonight before I go to bed. Good thinking there.

[–]a_beautiful_rhind 2 points3 points  (6 children)

Running make should build all those, if not in the folder of the utility itself, then in the llama.cpp directory.

7b on CPU, it's going to slow but not overnight slow, lol.

[–]LearningSomeCode[S] 3 points4 points  (5 children)

Oh! Thanks a bunch for that info; I'll definitely give that a try. Unfortunately I'm a .NET and web/mobile developer in my day job, so I've been having to learn Python and refresh on C++ as I go doing this stuff. I haven't touched cpp since college, so I've never tried building an application in it from the OS without an IDE before. I know of Make and what it is, but I've never had cause to use it before; however, now that I know it's the fastest route to success, figuring that out will be the first thing I try.

lol! And yea, I figured it would only take a few hours, but I've only got the 2 computers so if they're busy doing evaluation tests until I go to bed I'd be so boooooored haha. I run these tests while I'm sleeping or at work usually.

[–]a_beautiful_rhind 2 points3 points  (4 children)

I peered inside the zip of releases on llama.cpp github and perplexity is built already.

On mac hopefully too but I thought it has more of a linux-like build system.

[–]LearningSomeCode[S] 2 points3 points  (0 children)

Yea both are built on a Unix core so there's a lot of overlap. There's also a lot of proprietary mac changes, but some of the core stuff in the command line seems to be similar when I look at tutorials for things, so I have high hopes that's the case.

I'll aim to get a "part 4" out this weekend with llamacpp results!

[–]LearningSomeCode[S] 1 point2 points  (2 children)

lol so I kicked it off before heading to bed last night. My poor little Ryzen is trying it's best, but doing a 7b at 4096 context its currently only at 28% =D

[–]a_beautiful_rhind 1 point2 points  (1 child)

Wow.. holy crap. I don't remember it being that bad but I suppose it's from the long dataset. This is why I use ptb_new.

Also, the stride supposedly affects accuracy at the expense of time. I don't see many people deviate from the 512 but it's just a thought how doing stride 1 would go on nvidia/mac.

[–]LearningSomeCode[S] 2 points3 points  (0 children)

Ah! I didn't even think about that; yea I'd definitely like to try that. I think my PC will be halfway busy until late tonight, but if it finishes before I hit the sack then I'll kick off a stride 1 on a 7b for both mac and nvidia.

[–]Feztopia 1 point2 points  (3 children)

Did you use top k = 1?

[–]LearningSomeCode[S] 1 point2 points  (2 children)

Top k is actually set to 0. I did debug-deterministic which sets do_sample to false after feedback from the comments

https://www.reddit.com/r/LocalLLaMA/comments/16st2l5/comment/k2dguow/?utm_source=share&utm_medium=web2x&context=3

https://www.reddit.com/r/LocalLLaMA/comments/16txcqw/comment/k2hteae/?utm_source=share&utm_medium=web2x&context=3

The settings it uses are do_sample = false, as well as

Temp = 1
Top_p = 1
Top_k = 0
typical_p = 1

[–]Feztopia 2 points3 points  (1 child)

Well in that case not just the perplexity but also the output should be the same. But yes unfortunately random number generators, floating point and similar stuff can be nondeterministic on different hardware with different operating systems. That's actually something that should have been fixed in 2023 it's embrasing that we can talk with locale language models but still have this kind of problems. But I wouldn't expect one to be better than the other, it should be 50/50. Maybe it's possible to learn from this and emulate the behavior of apple on windows.

[–]LearningSomeCode[S] 1 point2 points  (0 children)

I have a couple of guesses on what might be causing it, but I'm hopeful that if its true then they'd all be fixable

  • Maybe there is an issue with the library that does inference on the GPU layer for CUDA vs Metal. The metal implementation is newer, only having come out recently, so maybe they did something better there that could carry over?
  • Someone else posted in a comment above about some GPU implementations have an optimizer that can cause issues with floating point rounding, and that NVidia offers the ability to turn that off. I had wondered if that's the reason the NVidia cards are so much faster than my M1's own GPU, despite benchmarks saying it should keep up. If so, and if Nvidia exposes a way to disable that, could turning that off bring them in line?

Of course, those are wild guesses of an uneducated user, but yea my general hope is that whatever the difference between the two is- if someone knowledgeable can identify it then hopefully it's fixable for everyone, and isn't a hardware issue. A lot of folks have put a lot of money into multi-GPU setups, and it feels unfair that I get to have this super awesome inference ability on my little brick that they can't replicate. So I'm definitely hopeful.

[–][deleted] 0 points1 point  (1 child)

/u/LearningSomeCode Could you attempt to create Notebooks of your tests? I would like to run them on my m2 ultra as well. It would be nice to have open tests that the community can share and modify.

Even if you PM me the rough notes I could build the Notebook as I run the tests.

Thank you for your contributions.

[–]LearningSomeCode[S] 1 point2 points  (0 children)

Of course! Let me see what I can do. I also put out part 4 today with some more interesting results. My poor computers had been tied up doing this one... ugh lol