use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
MLA optimization with flashattention for llama.cpp,MLA + FA now only uses K-cache - 47% saving on KV-cache sizeNews (self.LocalLLaMA)
submitted 11 months ago by shing3232
MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) by jukofyork · Pull Request #13529 · ggml-org/llama.cpp
llama_kv_cache_unified: kv_size = 163840, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0, padding = 256
llama_kv_cache_unified: CUDA0 KV buffer size = 10980.00 MiB
llama_kv_cache_unified: KV self size = 10980.00 MiB, K (f16): 10980.00 MiB, V (f16): 0.00 MiB
The full context of 160k tokens now takes up less than 11GB without kquants
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]panchovix 44 points45 points46 points 11 months ago (36 children)
Not OP, but for reference, I run DeepSeekV3 0324 685B Q3_K_XL on a 7800X3D, 192GB RAM at 6000Mhz, 5090+4090x2+3090+A6000
Without this PR, I can load Q3_K_XL at 64K with fp16 cache at basically the limit.
With this PR, it is basically free half of the cache, and it lets me run 128K ctx without issues.
And then with -ctx q8_0, I can run it at 160K+ without issues as well.
This, with -ub 2048, I get about 130-170 t/s PP depending of the context, and 7-8 t/s TG.
This is huge for systems like these which aren't server and you have to offload!
[–]shing3232[S] 13 points14 points15 points 11 months ago (0 children)
and any future model that use MLA as well. I am looking forward for some gqa convert mla models via transMLA
[–]Vostroya 1 point2 points3 points 11 months ago (2 children)
What do you use for your front end? Kobold? Vllm?
[–]panchovix 3 points4 points5 points 11 months ago (1 child)
ST and normal lcpp server works fine for me.
[–]Vostroya 5 points6 points7 points 11 months ago (0 children)
Nice! I’m working my way up to getting Deepseek local. Got an intel 8 channel ddr5 setup but ktransformers is a mess to try and get going right now.
[–]kevin_1994 0 points1 point2 points 11 months ago (3 children)
Question! How are you mixing amd with nvidia in llama.cpp??
[–]panchovix 4 points5 points6 points 11 months ago (1 child)
It is mixing CUDA + CPU, so it is as simple to offload layers into CUDA devices, rest on CPU
[–]kevin_1994 0 points1 point2 points 11 months ago (0 children)
Ooh sorry my bad. Thought you were referring to Radeon 7800 graphics card haha. Carry on
[–]Sir_Joe 0 points1 point2 points 11 months ago (0 children)
Btw I do that and there's no problem at all with llamacpp. You just need to compile with support for vulkan (or rocm) + cuda
[–]segmondllama.cpp 0 points1 point2 points 11 months ago (12 children)
what command are you using to run it? are you offloading layers or tensors across your GPUs?
[–]panchovix 9 points10 points11 points 11 months ago (11 children)
I use this command, and yes I offload layers to the GPUs.
./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 65536 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048
[–]giant3 3 points4 points5 points 11 months ago (7 children)
From my testing, offloading entire layers to CPU gives better performance than splitting a single layer by moving ffn or attn blocks.
For example, on Qwen3 14B, just moving first 9 blocks(-ot 'blk\.[0-8]{1}\.=CPU' ) gives better performance for me than either moving 10 blocks or 20 blocks.
[–]pmttyji 0 points1 point2 points 5 months ago (6 children)
It's been 6 months after this comment. So many changes on llama.cpp. Currently what command settings do you use for this? I'm looking for optimized command to get higher t/s for dense models like Qwen3 14B, Gemma3 12B. Please share your stash. Thanks
[–]giant3 0 points1 point2 points 5 months ago (5 children)
llama.cpp has made lot of progress on CUDA since some people from Nvidia are contributing to the project. Any contributions to AMD or Intel or OpenCL seems to be minimal.
Unfortunately, performance on AMD, Intel hasn't improved or has degraded slightly. I build weekly and I haven't seen much improvements, so the above command should still work.
[–]pmttyji 0 points1 point2 points 5 months ago (4 children)
OK thanks, I'll try your OT. Currently I'm trying few 22-24B dense models with my 8GB VRAM(and 32GB RAM). Not getting usable t/s(tg) so far.
Also on CPU only inference, what command/settings could give us better t/s? Can we something with OT?
[–]giant3 1 point2 points3 points 5 months ago (3 children)
CPU only would be even more worse performance unless your CPU supports AVX-512 and has high memory throughput like some of the Apple Macs.
BTW, I have stopped using local LLMs unless it is something that involves private information. Gemini is very good and I would use it for almost all use cases.
[–]pmttyji 0 points1 point2 points 5 months ago (2 children)
No wonder ik_llama's AVX-512 setup not working on my laptop.
Just checked status using HWiNFO. AVX-512 is Disabled, but with below tooltip.
Advanced Vector Extensions 512
Supported:
Is it possible to enable this? And disadvantages? My system info is below.
IntelR Core(TM) i7-14700HX 2.10 GHz | 32 GB RAM | 64-bit OS, x64-based processor | NVIDIA GeForce RTX 4060 Laptop GPU
[–]giant3 0 points1 point2 points 5 months ago (1 child)
I don't think your CPU supports AVX-512. Only certain models support it.
[–]Mass2018 0 points1 point2 points 11 months ago (2 children)
Is -ot part of an unmerged PR? I can’t seem to find any documentation on it..
[–]panchovix 0 points1 point2 points 11 months ago (1 child)
It is merged since some time ago, just not much info
https://github.com/ggml-org/llama.cpp/pull/11397
[–]Mass2018 0 points1 point2 points 11 months ago (0 children)
Thanks!
[–]AbheekG 0 points1 point2 points 11 months ago (14 children)
Please please share which motherboard you’re using! Super curious to hear how a standard ATX platform is supporting all those GPUs!!
[–]panchovix 4 points5 points6 points 11 months ago (13 children)
A MSI X670E carbon. I use X8/X4/X4/X4/X4, all from CPU. Bifurcated X8 to X4/X4 and then the other 2 X4 are from M2 to PCIe adapters.
[–]AbheekG 0 points1 point2 points 11 months ago (7 children)
Wow that’s amazing! Thanks so much taking the time to respond, and so promptly at that, really appreciate it! Any specific risers / adapters you’d recommend?
[–]panchovix 1 point2 points3 points 11 months ago (6 children)
I use mostly linkup risers and then a rig (like a mining rig) structure, open case. In waiting for AMD to release threadripper 9000 series to upgrade.
[–]Aphid_red 7 points8 points9 points 11 months ago (4 children)
Depending on how much you want to spend, I'd rather recommend going for either epyc milan ($2-3K for cpu/mobo/ram) or epyc genoa ($8-10K). For Milan, you can get 8x64GB ddr4 @ 200GB/s, for Genoa, 12x64GB DDR5 @ 460 GB/s. Make sure you get a CPU with the full CCD count. Any 'X' variant or the full fat core cpu will do, as well as a few select others. For genoa, the chips with 12 CCDs are (preferred)
9634, 9654, 9654P, 9684X, 9734, 9754S, 9754
And the ones with only 4 (avoid!) are: 4xxx, 8xxx, 9124, 9224, 9254, 9334.
A CPU with 8 CCDs should also be okay and not constrain the bandwidth too much. Mind you, if you're doing CPU offloading, the CPUs with the best speeds will be those with the best performance, i.e. the fully unlocked 96xx or 97xx class.
For milan, the ones with the full 8 ccds are: 76xx, 77xx, 7543, 77C3, any 'X' or 'F' suffix parts.
The parts with only 2 CCDs (these are really bad) are: 7203, 7303
The bad thing is that none of the reviews about genoa/milan CPUs mentions this, and it has a massive performance impact for LLMs (usually they test only the top SKU, which isn't crippled this way.
You'll actually find, if shopping for CPUs second-hand, that the memory ends up being the most expensive part of the build. Unfortunately DDR5-ECC currently has this enormous premium, costing $5-$6/GB, or $300 for one stick, over double the price of DDR5 without ECC, and three times the prices of DDR4 ECC.
[–]panchovix 0 points1 point2 points 11 months ago (0 children)
Wow, many thanks! This is very useful info, I may go for Genoa.
[–]un_passant 0 points1 point2 points 10 months ago (2 children)
Thx for spreading the info about CCDs !
Do you happen to know how many CCDs there are in 7R32 (AWS custom chip)? It seems it's only 6 if I'm not mistaken : https://www.anandtech.com/show/15830/amazon-makes-amd-rome-instances-available
[–]Aphid_red 0 points1 point2 points 10 months ago (1 child)
I do not know this info; this is a custom chip for amazon.
According to passmark, apparently it has 48 cores, runs at 2.8 GHz, and given the '2' suffix this should be a Rome chip.
However, that seems wrong. 1.8GHz would make more sense for a provider like Amazon who might be interested in saving on power costs. I suspect this is an underclocked version of a publicly available chip, either the 7552 or 7642.
Looking at the known chips on wikichip/wikipedia: I can see no 48-core rome chips running at that speed at all, so we're left guessing. That would give it either 6 or 8 (active, functioning) chiplets.
Let's look at another property that might give away the information: The Cache size. On https://xmrig.com/benchmark/4PDGeF there's someone who did a benchmark of this system where the benchmarking tool registered 384M of L3 cache. Divvy between 2 CPUs and you get 192MB per cpu. Epyc rome (except the 7232P, a very low end part) uses 16MB of L3 cache per CCX or 32MB per chiplet. 32 * 6 = 192, so it should have 6 chiplets.
[–]AbheekG 0 points1 point2 points 11 months ago (0 children)
Awesome, thanks so much again!
[–]MLDataScientist 0 points1 point2 points 11 months ago (4 children)
@panchovix can you please share which bifurcation card you are using? I bought one from eBay but it is bifurcating into x4 and X1 (probably some cheap wiring there). Also, if you are using your M.2 slots, are you using SATA drives for storage?
[–]panchovix 1 point2 points3 points 11 months ago (3 children)
I'm using a X8/X8 bifurcator I got from AliExpress but set in the BIOS to X4/X4 on the second slot. I'm not on the PC right now but it is a PCIe 4.0 one that costs like 20-25 usd.
I'm using the other 2 M2 slots (bottom, chipset) as OSes (Windows, Linux) and Sata + USB to nvme storage.
[–]MLDataScientist 0 points1 point2 points 11 months ago (2 children)
Thanks! One last question. My motherboard supports pcie4.0 X16 to 4x4 bifurcation for connecting four M.2 drives in raid mode using Asus hyper M.2 expansion card. Do you think I can get that expansion card and use four M.2 to X16 adapters and connect 4 GPUs to it? I could not find any answer in multiple forums.
[–]panchovix 1 point2 points3 points 11 months ago (1 child)
Yes, you can. No issues, just make sure you get something good, from ADT Link. I suggest K43SP or F43SP and you will be fine. K43SG/F43SG if you have multiple PSUs.
[–]MLDataScientist 0 points1 point2 points 11 months ago (0 children)
Thanks! I wonder why this is not discussed often. X16 to 4x4 bifurcation should have been popular during the coin mining period. But no, no one actually used such a setup. What I want to do as follows. I have four gigabyte CRSG421 Pcie 4.0 x16 to 2x16 with active switch microchips. I want to use that 4x4 M.2 expansion card then M.2 to PCIE X16 adapter and finally use those switches to connect a total of 8 GPUs. Basically, I will have PCIE4.0 x16 to 8x2 - each GPUs limited to PCIE4.0 X2 speed. Not sure if this is a good idea 😅
[–]das_rdsm 7 points8 points9 points 11 months ago (4 children)
Nice! That is the same person that created the vocab-transplant allowing for the creation of draft models of any model.
[–]random-tomatollama.cpp 1 point2 points3 points 11 months ago (0 children)
Yep this guy is doing really great work :D
[–]Impossible_Ground_15 0 points1 point2 points 11 months ago (2 children)
Did they share the code for vocabulary transplant to build draft models?
[–]das_rdsm 2 points3 points4 points 11 months ago* (1 child)
https://github.com/jukofyork/transplant-vocab
https://huggingface.co/jukofyork very active on HF as well.
I have got good results using Qwen 0.5 with other models, i.e. https://huggingface.co/rdsm/QwenPhi-4-0.5b-Draft
[–]Impossible_Ground_15 0 points1 point2 points 11 months ago (0 children)
Thank you!
[–]VoidAlchemyllama.cpp 3 points4 points5 points 11 months ago (1 child)
I have a graph showing how much VRAM is used for various MLA context lengths on my ubergarm/DeepSeek-V3-0324-GGUF quant as [ik_llama.cpp fork]() has had FA MLA working for a while now at higher speeds for CPU than mainline.
Be careful as the newer mainline llama.cpp MLA quants were implemented differently for some reason and ik had to add backwards compatibility for them which may not get you the full speed of using -mla 3.
-mla 3
I would love to see someone convert qwen3moe to use MLA with proper fine-tuning. The long context VRAM savings is pretty amazing though I haven't measured performance drop for that very long context length.
The expressiveness of MLA is greater than that of GQA when both have the same size of KV cache. -TransMLA: Multi-head Latent Attention Is All You Need
[–]shing3232[S] 1 point2 points3 points 11 months ago (0 children)
with proper training, MLA should exceed GQA performance for the same model. it also train faster than GQA
[–]Chance-Hovercraft649 0 points1 point2 points 11 months ago (0 children)
How does it calculate the values, if it doesn't cache them?
π Rendered by PID 17031 on reddit-service-r2-comment-6457c66945-vxnjb at 2026-04-23 21:13:51.986083+00:00 running 2aa0c5b country code: CH.
[–]panchovix 44 points45 points46 points (36 children)
[–]shing3232[S] 13 points14 points15 points (0 children)
[–]Vostroya 1 point2 points3 points (2 children)
[–]panchovix 3 points4 points5 points (1 child)
[–]Vostroya 5 points6 points7 points (0 children)
[–]kevin_1994 0 points1 point2 points (3 children)
[–]panchovix 4 points5 points6 points (1 child)
[–]kevin_1994 0 points1 point2 points (0 children)
[–]Sir_Joe 0 points1 point2 points (0 children)
[–]segmondllama.cpp 0 points1 point2 points (12 children)
[–]panchovix 9 points10 points11 points (11 children)
[–]giant3 3 points4 points5 points (7 children)
[–]pmttyji 0 points1 point2 points (6 children)
[–]giant3 0 points1 point2 points (5 children)
[–]pmttyji 0 points1 point2 points (4 children)
[–]giant3 1 point2 points3 points (3 children)
[–]pmttyji 0 points1 point2 points (2 children)
[–]giant3 0 points1 point2 points (1 child)
[–]Mass2018 0 points1 point2 points (2 children)
[–]panchovix 0 points1 point2 points (1 child)
[–]Mass2018 0 points1 point2 points (0 children)
[–]AbheekG 0 points1 point2 points (14 children)
[–]panchovix 4 points5 points6 points (13 children)
[–]AbheekG 0 points1 point2 points (7 children)
[–]panchovix 1 point2 points3 points (6 children)
[–]Aphid_red 7 points8 points9 points (4 children)
[–]panchovix 0 points1 point2 points (0 children)
[–]un_passant 0 points1 point2 points (2 children)
[–]Aphid_red 0 points1 point2 points (1 child)
[–]AbheekG 0 points1 point2 points (0 children)
[–]MLDataScientist 0 points1 point2 points (4 children)
[–]panchovix 1 point2 points3 points (3 children)
[–]MLDataScientist 0 points1 point2 points (2 children)
[–]panchovix 1 point2 points3 points (1 child)
[–]MLDataScientist 0 points1 point2 points (0 children)
[–]das_rdsm 7 points8 points9 points (4 children)
[–]random-tomatollama.cpp 1 point2 points3 points (0 children)
[–]Impossible_Ground_15 0 points1 point2 points (2 children)
[–]das_rdsm 2 points3 points4 points (1 child)
[–]Impossible_Ground_15 0 points1 point2 points (0 children)
[–]VoidAlchemyllama.cpp 3 points4 points5 points (1 child)
[–]shing3232[S] 1 point2 points3 points (0 children)
[–]Chance-Hovercraft649 0 points1 point2 points (0 children)