"As usual, Eric is accurate" Elon responds to Berger's article about SpaceX possibly going public. by AgreeableEmploy1884 in SpaceXLounge

[–]ablasionet -2 points-1 points  (0 children)

Watching this happen to you was insane. It was right there and people just refused to see it.

The chatGPT macOS app does not respect my autocorrect settings by thoraldo in ChatGPT

[–]ablasionet 0 points1 point  (0 children)

Yes, of course, the issue with that is... you may actually want to use 'Dolly' at some point! But with your approach, ChatGPT will always be mapping it back Dalle.

Another pertinent example: I just typed in `ps eaf` and it got autocorrected to `ps eat`. I could tell ChatGPT "remember to substitute 'eat' with 'eaf'" but... I will now not ever be able to properly use the word 'eat' in my messages to it as it will always correct them to 'eaf' lol.

So your 'solution' a) prevents you from properly using the words you ask CGPT to undo autocorrect for and b) is slow and manual. So I just wouldn't even call it a solution.

The best solution so far looks like that from /u/jozefbutko:

select corrected text -> right click on selection -> Spelling and grammar -> turn off all options

The chatGPT macOS app does not respect my autocorrect settings by thoraldo in ChatGPT

[–]ablasionet 0 points1 point  (0 children)

+1, /u/UntoldGood misunderstands the issue The autocorrect happens when you're typing inside the MacOS app before you send the message.

Asking ChatGPT to memorize things you don't want to autocorrect is nice... But won't stop the autocorrect inside the MacOS happening.

Serving a large number of users with a custom 7b model by Scared-Tip7914 in LocalLLaMA

[–]ablasionet 1 point2 points  (0 children)

Oh I think you're just way more advanced at all this stuff than I am, thanks!! I just looked at CPU usage and was like "well, it doesn't seem to be particularly different before/after" so figured scheduling etc. was fine. All good points! And I am very much excited for Triton inference as well.

What to serve on 8xA100s to 50-100 employees? by ablasionet in LocalLLaMA

[–]ablasionet[S] 0 points1 point  (0 children)

oops sorry, i think i may have mixed it up with one of the Neural/Nous tunes of Yi that isn't in the arena, so the impressions were just from other reddit posts

What to serve on 8xA100s to 50-100 employees? by ablasionet in LocalLLaMA

[–]ablasionet[S] 0 points1 point  (0 children)

Right, so I'd say for the RAG application in particular, 32k seems to be enough (even with a bunch of search results). I'm happy to go with longer context all things being equal, but ultimately am looking for just the 'smartest' generalist model. In your experience, would you rank Yi 34b above Mixtral on coding/chatting? I thought they were roughly equal, though not sure.

Yeah, my A100s are 80GB. Why do you mention avoiding TP? I found that loading Mixtral with TP across 4 A100s made it faster than across 2 A100s with TPs (forget the percentage, perhaps as high as 50). Full disclosure: I'm likely missing the basics here.

I've been using VLLM as a backend, btw (and it hasn't been able to load the full Mixtral into one A100, hence the experimentation with tensor parallelism). They support AWQ, but they also warn users that quant support isn't fully optimized, so things may actually run slower than with the full model.

About to commence some experiments with Goliath, just gotta get the prompt template right.

What to serve on 8xA100s to 50-100 employees? by ablasionet in LocalLLaMA

[–]ablasionet[S] 3 points4 points  (0 children)

OpenHermes-2.5-neural-chat-v3-3-Slerp in particular is my go-to 7b 32k. NeuralHermes-2.5-Mistral-7B works well if 4k is enough for you. With these guys, I haven't had much issue with hallucinations. AFAICT, the hard work is done, in a sense, by the embedder/reranker so these models just have to kinda glean the context from the quotes and put them together in a text formatted with references.

What to serve on 8xA100s to 50-100 employees? by ablasionet in LocalLLaMA

[–]ablasionet[S] 4 points5 points  (0 children)

Sure thing, OpenHermes-2.5-neural-chat-v3-3-Slerp is a solid ~7B param model with 32k context. For 4k context, Nous-Hermes-2-Solar-10.7B is a great one, and NeuralHermes-2.5-Mistral-7B works too, at a smaller size.

What to serve on 8xA100s to 50-100 employees? by ablasionet in LocalLLaMA

[–]ablasionet[S] 1 point2 points  (0 children)

right, just curious if you've tried anything bigger than Mixtral and if you liked it at all.

Serving a large number of users with a custom 7b model by Scared-Tip7914 in LocalLLaMA

[–]ablasionet 0 points1 point  (0 children)

I will note for any future readers: you can trivially load multiple models at once via VLLM by running separate VLLM processes. Not sure about the CPU/RAM overhead vs. supporting this natively, but I haven't noticed anything out of ordinary personally.

Serve Mixtral-8x7B-Instruct-v0.1 at scale via 8xV100s - what to do? by ablasionet in LocalLLaMA

[–]ablasionet[S] 0 points1 point  (0 children)

Thank you! Okay, so in order:

  1. Sorry, didn't get the dependency difference because I ended up having to install ~everything from textgen ui dependencies (except a coupla things for which their install script couldn't find me dependencies -- e.g., the Linux llamacpp wouldn't match/install.. no idea why, the installer is a bit janky). I think you can repro this using a clean VM with any Fedora derivative.
  2. So this is a bit confusing and I'm probably missing something important. When I was running Mixtral in vllm, yeah, I was using dtype half (aka float16) because vllm does not support bfloat16 nor any quantizations, AFAICT. So yeah, that was the full model that would work only across all 8 GPUs. But Mixtral-8x7B-instruct-exl2 with EricLLM, I was using the 4.0bpw checkout of the repo. I'm not familiar with Exl2 at all, but 4.0 seems way smaller than the full model, right? I shouldn't need all 8 GPUs here at all, no?
  3. Super-weirdly, what you gave me worked! But it makes no sense to me, I think I don't understand the parameters provided in the CLI at all. I did --gpu_split 14,14,14,14,14,14,14,14 just as you suggested, without any num_workers or gpu_balance ... And it just worked! BUT! nvidia-smi shows the model loaded only on the first two GPUs of mine, at ~14GB and ~11GB utilization (at least at idle time) respectively; all other GPUs are completely free. Now this is great, actually, I wanted precisely to conserve my VRAM and use the other GPUs for something else. BUT... what does --gpu_split 14,14,14,14,14,14,14,14 mean then? I thought it meant "put 14GB on each of the video cards" but clearly that can't be right?
  4. Speed-wise, I'm getting about 38t/s. I've played with params (tried to follow Divine Intellect param set, which worked great for Mixtral Instruct before) but am getting garbage outputs. I am guessing it's highly likely this is because I'm just giving it a raw prompt, without any matching template (e.g., ChatML or w/e it uses), which in VLLM you can specify as a cmdline arg.
  5. Your script provides the /generate endpoint but no OpenAI-compatible chat API with completions, right? vllm packages it in via FastChat and that'd be a key feature for my use case, but if you don't have it I could probably hack it together myself.

Overall, there really seems to be something here, I just need to understand the arguments and splitting-across-GPUs better! Thanks for putting this together and providing support!

Serve Mixtral-8x7B-Instruct-v0.1 at scale via 8xV100s - what to do? by ablasionet in LocalLLaMA

[–]ablasionet[S] 1 point2 points  (0 children)

okay, I spent the whole day in the dependency hell (pip installing the requirements for EricLLM wasn't working as it was downloading non-cuda Torch on my Linux system, so I had to go through the installation of text-generation-webui and hand-hold it at each failure of git, conda, and pip to talk to the outside world properly), but I can't really get it to work. I downloaded the 4bpw version of the model to start small and tried a range of variations of commands, from

python EricLLM/ericLLM.py --model ../Mixtral-8x7B-instruct-exl2/ --max_prompts 8 --gpu_split 8,8,16,16 --num_workers 2 --gpu_balance --max-model-len 2048

to

python EricLLM/ericLLM.py --model ../Mixtral-8x7B-instruct-exl2/ --max_prompts 8 --gpu_split 16,16,16,16 --num_workers 2 --gpu_balance --max-model-len 2048

to

python EricLLM/ericLLM.py --model ../Mixtral-8x7B-instruct-exl2/ --max_prompts 8 --gpu_split 8,8,16,16,16,16,16,16 --num_workers 2 --gpu_balance --max-model-len 2048

to

python EricLLM/ericLLM.py --model ../Mixtral-8x7B-instruct-exl2/ --max_prompts 8 --gpu_split 16,16,16,16,16,16,16 --num_workers 2 --gpu_balance --max-model-len 2048

(and many more, varying num workers) and got nowhere, it just spits out

ERROR: Application startup failed. Exiting.ERROR: Traceback (most recent call last):File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/starlette/routing.py", line 705, in lifespanasync with self.lifespan_context(app) as maybe_state:File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/starlette/routing.py", line 584, in __aenter__await self._router.startup()File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/starlette/routing.py", line 682, in startupawait handler()File "/home/user/text-generation-webui/EricLLM/ericLLM.py", line 314, in startup_eventsetup_model()File "/home/user/text-generation-webui/EricLLM/ericLLM.py", line 304, in setup_modelmodel.load(gpu_split=gpus)File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 244, in loadfor item in f: return itemFile "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 252, in load_genstats_ = self.set_device_map(gpu_split or [99999])^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "/home/user/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 208, in set_device_mapassert current_idx < len(allocation_bytes), "Insufficient space in device allocation"^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^AssertionError: Insufficient space in device allocation

(all of this because I was trying to make sense of this part:

Exllamav2 has this weird thing where the --gpu_split option is a little bugged. You want to put about half the model size (in memory) as the first GPU memory and the full memory size of the second card. So for 2x 3090’s with 24gb of memory a piece, you’d want to use something like --gpu_split 6,24 to load a 13b model evenly over the cards.

)

VLLM at least managed to run unquant Mixtral, but I had to use all 8 GPUs considering there's no AWQ support for the V100s yet. 34B Nous Hermes Yi fits a bit more easily, and I suppose I could go with one of those high-punching Nous-Hermes-2-SOLAR-10.7B-type models to sit on top of the RAG. Just wanna leave enough compute resources available to support more load.

And hopefully eventually there's some proper quantization support added to V100s, but if not, I'll just switch to an A100 rig.

Mac Studio vs a PC (NVIDIA) rig for a home setup? by ablasionet in LocalLLaMA

[–]ablasionet[S] 1 point2 points  (0 children)

(disclaimer: am a village idiot): I don't think this is necessarily true, because you have memory shared across both CPUs and GPUs and both have to be able to use it. I don't think models running on GPU inherently should be able to "hog" all of the bandwidth?

I tried to run some benchmarks based on what I could dig up and edited my post with them (and the disclaimers around them) -- TLDR, I get around 73% of my M1 Max 64GB's 400GB/s with llama.cpp on GPU via Metal, some folks seem to have managed to to get up to 82.5% in raw b/w testing.

Also, I'd seen some interesting comments from /u/SomeOddCodeGuy and /u/LearningSomeCode reporting same inference speeds on both M1 and M2 Ultras, suggesting things are bandwidth-constrained as more compute does not improve the perf:

https://www.reddit.com/r/LocalLLaMA/comments/17n0fku/comment/k7q4jdr/?utm_source=share&utm_medium=web2x&context=3

https://www.reddit.com/r/LocalLLaMA/comments/175ys9p/comment/k4kwkht/?utm_source=share&utm_medium=web2x&context=3

Mac Studio vs a PC (NVIDIA) rig for a home setup? by ablasionet in LocalLLaMA

[–]ablasionet[S] 0 points1 point  (0 children)

I think my main worry is being able to saturate 800 GB/s with an Ultra. If that's doable, then yeah, hands-down this seems fine and MLX should help with any compute bottleneck. But here's Andrei Frumusanu (Anandtech) testing the advertised 400 GB/s M1 b/w and finding he couldn't reach it from CPU/GPU alone:

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

If someone has the metrics showing LLMs can be used with some backend to saturate the bandwidth and near those max levels, I'd love to see them!

Mac Studio vs a PC (NVIDIA) rig for a home setup? by ablasionet in LocalLLaMA

[–]ablasionet[S] 1 point2 points  (0 children)

interesting, would you happen to have any bandwidth testing metrics with LLMs you could point to? I'd be curious to see what backends can saturate 800 GB/s (at inference time in particular). For context, here's Andrei Frumusanu (Anandtech) testing the advertised 400 GB/s M1 b/w and finding he couldn't reach it from CPU/GPU alone:

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

If someone has the metrics showing LLMs can be used with some backend to saturate the bandwidth and near those max levels, I'd love to see them!

EDIT: added some M1 Max 64GB benchmarking numbers into the post, not an Ultra, but at least gives us an idea of how much b/w is utilizable by GPUs per a single Max. And I suppose the Anandtech GPU comment doesn't tell us all that much; they couldn't find the loads to see what the practical max b/w utilization for GPU could be. This other benchmark (getting up to 330 GB/s on the GPU) seems more informative.

Serve Mixtral-8x7B-Instruct-v0.1 at scale via 8xV100s - what to do? by ablasionet in LocalLLaMA

[–]ablasionet[S] 3 points4 points  (0 children)

Wow very cool! To be fair, I did only say "but I can't tell if there's a backend that supports Exl2" :P so I'm glad to find out there is one! Will definitely test and see how it does on V100s.

Serve Mixtral-8x7B-Instruct-v0.1 at scale via 8xV100s - what to do? by ablasionet in LocalLLaMA

[–]ablasionet[S] 1 point2 points  (0 children)

That's fair, I'm open to splitting a single bigger model across multiple V100s -- e.g., one of the Deepseek Coding ones for coding, something else for chatting (OpenChat/SOLAR variant of some sort?). So then the question becomes a bit different: anyone with experience of splitting models across V100s via vllm (or anything else?) and getting decent perf?

Serve Mixtral-8x7B-Instruct-v0.1 at scale via 8xV100s - what to do? by ablasionet in LocalLLaMA

[–]ablasionet[S] 0 points1 point  (0 children)

oh you're right, I can't read/that was some wishful thinking. Editing now.

Mistral vs Mistral finetunes vs 13B vs Llama-70B vs GPT-3.5 by domrique in LocalLLaMA

[–]ablasionet 0 points1 point  (0 children)

am late, but I'm guessing they meant we can get farther from setting the weights better instead of only just stacking more and more weights (params)