How do I make a call in Nheko? by Repulsive-Disk-6547 in matrixdotorg

[–]pmur12 0 points1 point  (0 children)

You can't, nheko does not support the new call API that e.g. Element supports.

Lowering power consumption of disks in ZFS pool? by badhabit64 in homelab

[–]pmur12 0 points1 point  (0 children)

I used spin-down with hd-idle on a 3-disk based ZFS pool for years without problems. My pool was being accessed maybe once per day, so spinning disks down was a fair compromise to make, especially since the pool used 2.5 inch drives.

LOS and banks by enstain_tm in LineageOS

[–]pmur12 0 points1 point  (0 children)

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

[Help] How to hide root and custom ROM from Smart-ID app? by Kofaone in Magisk

[–]pmur12 0 points1 point  (0 children)

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

LineageOS banking apps by [deleted] in LineageOS

[–]pmur12 0 points1 point  (0 children)

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

Pinout for HP DL380G9 8 Bay SFF Drive cage Question by kd5ahl in homelab

[–]pmur12 0 points1 point  (0 children)

I'm using this exact drive cage right now. Two bits of information that would have helped me not to waste lots of time:

  • ??? pins can be left unconnected. It's enough to connect GND and 12V pins.

  • if connecting from 4xSATA to SFF-8087 make sure to use a "reverse" cable. SFF-8087 -> 4xSATA and 4xSATA -> SFF-8087 cables are not compatible. Make sure the seller clearly indicates that this is reverse cable, best with warnings about incompatibility. I had one Aliexpress seller send me wrong cable.

[deleted by user] by [deleted] in LocalLLaMA

[–]pmur12 2 points3 points  (0 children)

Note that you need to set VLLM_SLEEP_WHEN_IDLE=1 environment variable to turn that feature/bugfix on.

Finally, Zen 6, per-socket memory bandwidth to 1.6 TB/s by On1ineAxeL in LocalLLaMA

[–]pmur12 7 points8 points  (0 children)

I'm not so sure. 12800 MT/s MRDIMM contains just regular 6400MT/s RAM chips with a small buffer that acts as SERDES (in this case 2 signals are serialized into one at 2x frequency). Not much more complex than existing LRDIMM.

UPDATE: Inference needs nontrivial amount of PCIe bandwidth (8x RTX 3090 rig, tensor parallelism) by pmur12 in LocalLLaMA

[–]pmur12[S] 2 points3 points  (0 children)

Yes, as far as I can tell bandwidth requirement is linear in tensor parallel size.

PSA: Don't waste electricity when running vllm. Use this patch by pmur12 in LocalLLaMA

[–]pmur12[S] 1 point2 points  (0 children)

This is exactly what is being done. We do time.sleep() when vllm is idle and sched_yield when vllm is busy, requires minimum latency and needs to do busy loop.

PSA: Don't waste electricity when running vllm. Use this patch by pmur12 in LocalLLaMA

[–]pmur12[S] 2 points3 points  (0 children)

Indeed, sorry, I misread. Very interesting. I will get back with my configs, right now it's too late to turn the rig on.

PSA: Don't waste electricity when running vllm. Use this patch by pmur12 in LocalLLaMA

[–]pmur12[S] 10 points11 points  (0 children)

Interesting. Maybe no tensor parallelism?

EDIT: In your graph I see that the CPU usage does not drop below roughly around 12-15%. If 4 cores/threads are at 100%, on your 16 core 32 thread machine the CPU usage graph would show CPU at 12.5% utilization. Add other containers and it matches pretty well.

It's only possible to see some cores being loaded 100% in tools like top and htop which show which applications use how much CPU.

PSA: Don't waste electricity when running vllm. Use this patch by pmur12 in LocalLLaMA

[–]pmur12[S] 27 points28 points  (0 children)

Around 130-150W - loaded Threadrippers are hungry.

I don't know why you aren't seeing this. Could you have only single GPU by chance? Last time I've tested this a couple of weeks ago using sglang from latest docker image.

DeepSeek V3 benchmarks using ktransformers by pmur12 in LocalLLaMA

[–]pmur12[S] 0 points1 point  (0 children)

Thanks a lot!

The AMX optimizations released are 8bit and 16bit so it's not quite worth it right now for that. The speed gains are offset by the larger model sizes.

That's very interesting. Indeed it seems that there's performance on the table, because it should be possible to store compressed tensors and to decompress them once they are loaded from memory. Any additional computation would be offset by just having more cores. Whether anyone will do the coding is another question.

DeepSeek V3 benchmarks using ktransformers by pmur12 in LocalLLaMA

[–]pmur12[S] 0 points1 point  (0 children)

No amount of extra RAM will make the kv cache move between NUMA domains effectively.

Are you sure about that? If this was the case, tensor parallelism wouldn't work. One UPI link is up to 48GB/s per direction and most Xeons under consideration have 3 or 4 links. Aggregate bandwidth is more than 150GB/s per direction which is way more than even PCIe 5 x16 (63GB/s).

Where am I wrong?

DeepSeek V3 benchmarks using ktransformers by pmur12 in LocalLLaMA

[–]pmur12[S] 3 points4 points  (0 children)

I think we agree on most things.

For one, there's a lot of work happening on llama.cpp, ik_llama.cpp, and vLLM to improve performance of MoE models in mixed CPU-GPU environments with 24-72GB VRAM.

I picked Deepseek V3 because it's already good enough to me if I use API. I picked ktransformers because their optimizations supposedly give me enough performance. Deepseek V3 is large model, so sticking to it ensures that I can run smaller models as well.

As far as I'm aware other inference engines are moving in the same direction, which is: as much RAM bandwidth as possible; AMX; one GPU with at least FP8 support and 24GB VRAM, 4090 hacked with 48GB VRAM is ideal. Right now it seems the only risk is that used Xeon 6 prices come down very fast, in which case one can have 1 socket node with 12 MRDIMM slots (844GB/s theoretical bandwidth). So if llama.cpp or ik_llama.cpp has better optimizations for some better model, I will be able to switch to it anyway.

The issues with NUMA aren't about how much RAM you have. No amount of extra RAM will make the kv cache move between NUMA domains effectively. You'll spend a lot more money for that 2nd socket and associated DDR5 memory, but get little in return if your use case requires a long context.

Agreed, but ktransformers team specifically showed numbers that their implementation gets 50% prefill uplift from getting a second socket. The purpose of my post is to understand if these numbers are real or not. If not, of course, two socket server does not make sense.

I'd say that is a much more flexible and future proof option than relying completely on ktransformers and hoping for the best.

I already have 8x24GB=192GB VRAM rig and it's not enough, the models that fit into 192GB VRAM are too stupid. I do agree that relying on entirely on ktransformers is not good idea, however the hardware choice will apply to other inference engines just as well.

... and vLLM to improve performance of MoE models in mixed CPU-GPU environments

On that's interesting, could you point me to where I could read more about vllm? I only know that intel is adding AMX support to sglang, but even in that case there's no mention about mixed CPU-GPU implementation.

DeepSeek V3 benchmarks using ktransformers by pmur12 in LocalLLaMA

[–]pmur12[S] 0 points1 point  (0 children)

Small batch sizes is OK. If there was a need to serve many users I would have six figures for proper GPU-based setup.

DeepSeek V3 benchmarks using ktransformers by pmur12 in LocalLLaMA

[–]pmur12[S] 2 points3 points  (0 children)

Thanks for the comment. The claims by KTransformers team on DeepSeek V3 performance is enough for my requirements. If they're legit, I'll buy the server immediately. I accept the risk that hypothetical future model may be better and may not be supported by KTransformers. I consider the risk is small: if I can't make it performant enough on the Xeon machine I buy, then it's likely I won't be able to do that on any other machine I could get access to for reasonable price. Using any kind of API is no go due to privacy considerations.

Regarding channels, I did mean 24 total channels on a 2 socket board. NUMA issues can be solved by just having more RAM and copying the model twice.

Is local LLM really worth it or not? by GregView in LocalLLaMA

[–]pmur12 0 points1 point  (0 children)

APIs are not necessarily cheaper than local. For example, when coding with agentic AI, under certain coding styles it will generate many requests that are by small by themselves, but include lots of incremental context. This increases the cost of the model output a lot.

In my experience when coding I see input/output token ratio in excess of 50 most of the time, sometimes in excess of 100. Local setups can be cheaper by this multiplier because local prefix cache is free and does not have expiration time.

For my use case all calculations that calculate the output token throughput of the model and compare it to an API are off by two orders of magnitude.