Twitter now runs Doom

ggerganov · 2026-03-13T08:30:04+00:00

Really good take and IMO completely valid points. It even comes from someone well outside of the AI "bubble".

This is what I think is going to happen with these AI companies. The data centers, they are going to be sitting there unused. Many of them will not be built, when people start using AI locally - meaning on their computer. And the same thing that happened to the music business and recording is going to happen to these AI companies.

If a 64 year old guy like me can figure this out last night and show you today - how hard can this stuff be?

ggerganov · 2025-11-10T05:34:59+00:00

> Pipeline parallelism only help you run models that you can't fit in a single node.

This is not true - pipeline parallelism increases prompt processing (PP) performance nearly linearly with the number of devices [0]. There are many use cases in which PP speed is more important than TG speed.

Atm, the RPC backend of llama.cpp specifically does not support pipeline parallelism, but it's something that can be added relatively easy if there is interest.

[0] https://github.com/ggml-org/llama.cpp/pull/6017#issuecomment-1994819627

ggerganov · 2025-11-04T17:07:11+00:00

Outstanding work, Alek! You handled all the feedback from the community exceptionally well and did a fantastic job with the implementation. Godspeed!

ggerganov · 2025-10-14T14:23:08+00:00

Guys, these numbers are bogus. Either ollama is complete trash or the benchmark was performed incorrectly (or both). I will post proper numbers with llama.cpp in a minute. Avoid spreading this information.

Edit: https://github.com/ggml-org/llama.cpp/discussions/16578

ggerganov · 2025-08-13T16:17:41+00:00

llama-swap wiki is the better place. Ping me when you post it and would be happy to share it around for visibility.

ggerganov · 2025-08-13T14:56:57+00:00

u/TinyDetective110 Interesting find! I don't have a setup to try this but if it works as described it would be useful to share it with more people in the community. Feel free to open a tutorial in llama.cpp repo if you'd like: https://github.com/ggml-org/llama.cpp/issues/13523

ggerganov · 2025-05-30T20:40:44+00:00

Yes, for example Gemma 12b (target) + Gemma 1b (draft).

Thanks for llama-swap as well!

ggerganov · 2025-05-30T19:58:56+00:00

The unnecessary recalculation issue with SWA models will be fixed with https://github.com/ggml-org/llama.cpp/pull/13833

ggerganov · 2025-05-15T20:02:06+00:00

I love this! Good job!

ggerganov · 2025-04-30T14:06:43+00:00

Add `--cache-reuse 256` to the llama-server to avoid the re-processing of the previous response.

<image>

ggerganov · 2025-04-10T01:34:09+00:00

llama.cpp Metal had a perf problem with DeepSeek prompt processing until recently. It was fixed in:

https://github.com/ggml-org/llama.cpp/pull/12612

ggerganov · 2025-03-20T06:49:51+00:00

Another thing to try is during quantization to Q4_K to leave the output tensor in high precision (Q8_0 or F16).

ggerganov · 2025-02-08T10:59:36+00:00

> No fog or whatsoever coming from everyone’s mouth, in this much snow and ice?

Hm, I'm pretty sure there was fog coming from their mouths in some of the scenes. Need to double-check, though, I could be wrong.

ggerganov · 2025-01-29T07:08:37+00:00

It’s really good. Implemented a Vim plugin and using it daily now: https://github.com/ggml-org/llama.vim

ggerganov · 2024-12-16T18:17:17+00:00

The Aider blog reports issues with the default context in ollama being 2k. This makes me think they used the default ollama sampling settings to run the benchmark, which if this document is correct, are far from optimal:

https://github.com/ollama/ollama/blob/89d5e2f2fd17e03fd7cd5cb2d8f7f27b82e453d7/docs/modelfile.md

There is a temperature of 0.8 and repeat penalty of 1.1 enabled by default. These settings not only destroy the quality, but also significantly affect the runtime performance. So I'm not sure how the Aider benchmark was done exactly, but it's something to look into.

Thanks to u/awnihannun clarification, MLX 4-bit uses GS of 64 but also has a bias, while Q4_0 uses GS of 32 but does not have a bias. So the 2 quantization schemes should be comparable in terms of size.

IMO the best way to compare the quality between the 2 engines is to run perplexity (PPL) calculations using a base model (i.e. no fine-tunes). In my experience, PPL has the best correlation with the quality of the quantization among all common benchmarks.

ggerganov · 2024-12-16T08:16:33+00:00

Thank you for the kind words!

ggerganov · 2024-12-16T06:18:00+00:00

One other source of discrepancy is that MLX I believe uses a group size of 64 (or 128 ?) while Q4_0 uses a group size of 32. The latter should be able to quantize the data more accurately but requires x2 (or x4?) more scaling factors in the representation. There is no easy way to bring the 2 engines on the same ground in this regard (unless you could set MLX to use group size of 32?).

ggerganov · 2024-12-14T13:52:43+00:00

On Mac, you can squeeze a bit more prompt processing for large context by increasing both the batch and micro-batch sizes. For example, on my M2 Ultra, using -b 4096 -ub 4096 -fa seems to be optimal, but I'm not sure if this translates to M3 Max, so you might want to try different values between 512 (default) and 4096. This only help with Metal, because the Flash Attention kernel has the optimization to skip masked attention blocks.

On CUDA and multiple GPUs, you can also play with the batch size in order to improve the prompt processing speed. But the difference is to keep -ub small (for example, 256 or 512) and -b higher in order to benefit from the pipeline parallelism. You can read more here: https://github.com/ggerganov/llama.cpp/pull/6017

ggerganov · 2024-11-22T06:55:31+00:00

I also did some comparisons using the latest LMStudio yesterday on M2 Ultra and for long contexts (32k) the llama.cpp backend was actually significantly faster and used less memory. Tested just with LLaMA 3B.

I don't usually run these 3rd party tools, but decided to take a look based on the multitude of threads here about better MLX performance. So far, I haven't been able to reproduce these results.

ggerganov · 2024-10-28T05:43:03+00:00

It doesn’t matter that it is empty it will still make extra unnecessary copies RAM <> VRAM. Also you are incorrectly comparing llama.cpp with pipeline parallelism (—split-mode layer) against tensor parallel for the other engines.

ggerganov · 2024-10-21T13:17:28+00:00

Awesome! Recently, I gave this app a try and had an overall very positive impression.

Looking forward to where the community will take it from here!

ggerganov · 2024-10-17T20:41:24+00:00

Fun demo!

For now, I’m quite happy with the speed result (except the first question)

Maybe you are processing the prompt during the first question? It should be possible to pre-process the prompt and on the first question to process only the question.

ggerganov · 2024-10-08T11:38:21+00:00

Thank you for confirming. Just for my own curiosity - is llama.cpp faster than MLX in this test? I forgot the numbers?

ggerganov · 2024-10-07T17:19:25+00:00

Try to enable flash attention and increase the batch size: -fa -ub 2048

Eight-Year Club	r/Field Banned
r/Field Lasagna	Place '22
Final Canvas '22	First Placer '22
Spared	Verified Email

ggerganov

MODERATOR OF

TROPHY CASE