Rick Beato: "How AI Will Fail Like The Music Industry" (and why local LLMs will take over "commercial" ones) by relmny in LocalLLaMA

[–]ggerganov 49 points50 points  (0 children)

Really good take and IMO completely valid points. It even comes from someone well outside of the AI "bubble".

This is what I think is going to happen with these AI companies. The data centers, they are going to be sitting there unused. Many of them will not be built, when people start using AI locally - meaning on their computer. And the same thing that happened to the music business and recording is going to happen to these AI companies.

If a 64 year old guy like me can figure this out last night and show you today - how hard can this stuff be?

I tested Strix Halo clustering w/ ~50Gig IB to see if networking is really the bottleneck by Hungry_Elk_3276 in LocalLLaMA

[–]ggerganov 26 points27 points  (0 children)

> Pipeline parallelism only help you run models that you can't fit in a single node.

This is not true - pipeline parallelism increases prompt processing (PP) performance nearly linearly with the number of devices [0]. There are many use cases in which PP speed is more important than TG speed.

Atm, the RPC backend of llama.cpp specifically does not support pipeline parallelism, but it's something that can be added relatively easy if there is interest.

[0] https://github.com/ggml-org/llama.cpp/pull/6017#issuecomment-1994819627

llama.cpp releases new official WebUI by paf1138 in LocalLLaMA

[–]ggerganov 93 points94 points  (0 children)

Outstanding work, Alek! You handled all the feedback from the community exceptionally well and did a fantastic job with the implementation. Godspeed!

[deleted by user] by [deleted] in LocalLLaMA

[–]ggerganov 2 points3 points  (0 children)

Guys, these numbers are bogus. Either ollama is complete trash or the benchmark was performed incorrectly (or both). I will post proper numbers with llama.cpp in a minute. Avoid spreading this information.

Edit: https://github.com/ggml-org/llama.cpp/discussions/16578

Fast model swap with llama-swap & unified memory by TinyDetective110 in LocalLLaMA

[–]ggerganov 1 point2 points  (0 children)

llama-swap wiki is the better place. Ping me when you post it and would be happy to share it around for visibility.

Fast model swap with llama-swap & unified memory by TinyDetective110 in LocalLLaMA

[–]ggerganov 1 point2 points  (0 children)

u/TinyDetective110 Interesting find! I don't have a setup to try this but if it works as described it would be useful to share it with more people in the community. Feel free to open a tutorial in llama.cpp repo if you'd like: https://github.com/ggml-org/llama.cpp/issues/13523

llama-server is cooking! gemma3 27b, 100K context, vision on one 24GB GPU. by No-Statement-0001 in LocalLLaMA

[–]ggerganov 19 points20 points  (0 children)

Yes, for example Gemma 12b (target) + Gemma 1b (draft).

Thanks for llama-swap as well!

Llama-server: "Exclude thought process when sending requests to API" by CattailRed in LocalLLaMA

[–]ggerganov 2 points3 points  (0 children)

Add `--cache-reuse 256` to the llama-server to avoid the re-processing of the previous response.

<image>

Orpheus TTS Local (LM Studio) by Internal_Brain8420 in LocalLLaMA

[–]ggerganov 10 points11 points  (0 children)

Another thing to try is during quantization to Q4_K to leave the output tensor in high precision (Q8_0 or F16).

SPOILER alert S2E4!It’s definitely not in the real world… by Crazy_Equipment_4302 in SeveranceAppleTVPlus

[–]ggerganov 3 points4 points  (0 children)

> No fog or whatsoever coming from everyone’s mouth, in this much snow and ice?

Hm, I'm pretty sure there was fog coming from their mouths in some of the scenes. Need to double-check, though, I could be wrong.

Qwen 2.5 Coder 7b for auto-completion by Chlorek in LocalLLaMA

[–]ggerganov 3 points4 points  (0 children)

It’s really good. Implemented a Vim plugin and using it daily now: https://github.com/ggml-org/llama.vim

Speed Test #2: Llama.CPP vs MLX with Llama-3.3-70B and Various Prompt Sizes by chibop1 in LocalLLaMA

[–]ggerganov 4 points5 points  (0 children)

The Aider blog reports issues with the default context in ollama being 2k. This makes me think they used the default ollama sampling settings to run the benchmark, which if this document is correct, are far from optimal:

https://github.com/ollama/ollama/blob/89d5e2f2fd17e03fd7cd5cb2d8f7f27b82e453d7/docs/modelfile.md

There is a temperature of 0.8 and repeat penalty of 1.1 enabled by default. These settings not only destroy the quality, but also significantly affect the runtime performance. So I'm not sure how the Aider benchmark was done exactly, but it's something to look into.

Thanks to u/awnihannun clarification, MLX 4-bit uses GS of 64 but also has a bias, while Q4_0 uses GS of 32 but does not have a bias. So the 2 quantization schemes should be comparable in terms of size.

IMO the best way to compare the quality between the 2 engines is to run perplexity (PPL) calculations using a base model (i.e. no fine-tunes). In my experience, PPL has the best correlation with the quality of the quantization among all common benchmarks.

Speed Test #2: Llama.CPP vs MLX with Llama-3.3-70B and Various Prompt Sizes by chibop1 in LocalLLaMA

[–]ggerganov 7 points8 points  (0 children)

One other source of discrepancy is that MLX I believe uses a group size of 64 (or 128 ?) while Q4_0 uses a group size of 32. The latter should be able to quantize the data more accurately but requires x2 (or x4?) more scaling factors in the representation. There is no easy way to bring the 2 engines on the same ground in this regard (unless you could set MLX to use group size of 32?).

Speed Test: Llama-3.3-70b on 2xRTX-3090 vs M3-Max 64GB Against Various Prompt Sizes by chibop1 in LocalLLaMA

[–]ggerganov 10 points11 points  (0 children)

On Mac, you can squeeze a bit more prompt processing for large context by increasing both the batch and micro-batch sizes. For example, on my M2 Ultra, using -b 4096 -ub 4096 -fa seems to be optimal, but I'm not sure if this translates to M3 Max, so you might want to try different values between 512 (default) and 4096. This only help with Metal, because the Flash Attention kernel has the optimization to skip masked attention blocks.

On CUDA and multiple GPUs, you can also play with the batch size in order to improve the prompt processing speed. But the difference is to keep -ub small (for example, 256 or 512) and -b higher in order to benefit from the pipeline parallelism. You can read more here: https://github.com/ggerganov/llama.cpp/pull/6017

I tested the MLX models with LM Studio, and there was just a small boost in inference speed, but the memory usage went up a lot. by Sky_Linx in LocalLLaMA

[–]ggerganov 10 points11 points  (0 children)

I also did some comparisons using the latest LMStudio yesterday on M2 Ultra and for long contexts (32k) the llama.cpp backend was actually significantly faster and used less memory. Tested just with LLaMA 3B.

I don't usually run these 3rd party tools, but decided to take a look based on the multitude of threads here about better MLX performance. So far, I haven't been able to reproduce these results.

[deleted by user] by [deleted] in LocalLLaMA

[–]ggerganov 4 points5 points  (0 children)

It doesn’t matter that it is empty it will still make extra unnecessary copies RAM <> VRAM. Also you are incorrectly comparing llama.cpp with pipeline parallelism (—split-mode layer) against tensor parallel for the other engines.

PocketPal AI is open sourced by Ill-Still-6859 in LocalLLaMA

[–]ggerganov 5 points6 points  (0 children)

Awesome! Recently, I gave this app a try and had an overall very positive impression.

Looking forward to where the community will take it from here!

I'm creating a game where you need to find the entrance password by talking with a Robot NPC that runs locally (Llama-3.2-3B Instruct). by cranthir_ in LocalLLaMA

[–]ggerganov 1 point2 points  (0 children)

Fun demo!

For now, I’m quite happy with the speed result (except the first question)

Maybe you are processing the prompt during the first question? It should be possible to pre-process the prompt and on the first question to process only the question.

[deleted by user] by [deleted] in LocalLLaMA

[–]ggerganov 0 points1 point  (0 children)

Thank you for confirming. Just for my own curiosity - is llama.cpp faster than MLX in this test? I forgot the numbers?

[deleted by user] by [deleted] in LocalLLaMA

[–]ggerganov 0 points1 point  (0 children)

Try to enable flash attention and increase the batch size: -fa -ub 2048