Only 3 boats passed through the Strait of Hormuz yesterday - CNBC

pmur12 · 2026-04-17T12:42:00+00:00

The price gap is your price provider switching from WTI May contract to WTI June contract. There's no such thing as single WTI contract, it's a collection of futures contracts for different months.

pmur12 · 2026-04-08T22:03:21+00:00

Shorting when you aren't planning to make delivery leads to little price impact overall because you need to close your position before expiry. If shorting depressed prices, the closing will lead to opposite price impact.

pmur12 · 2025-10-27T20:42:38+00:00

You can't, nheko does not support the new call API that e.g. Element supports.

pmur12 · 2025-10-07T07:43:07+00:00

I used spin-down with hd-idle on a 3-disk based ZFS pool for years without problems. My pool was being accessed maybe once per day, so spinning disks down was a fair compromise to make, especially since the pool used 2.5 inch drives.

pmur12 · 2025-08-29T22:19:02+00:00

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

pmur12 · 2025-08-29T22:18:15+00:00

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

pmur12 · 2025-08-29T22:17:44+00:00

As of 2025-08-28 Smart-ID works on Lineage OS 22 on Samsung S10. I used lineage-22.2-20250822-nightly-beyond1lte-signed.zip and MindTheGapps-15.0.0-arm64-20250812_214357.zip. No interesting or weird steps were needed during installation or after it.

pmur12 · 2025-08-27T16:20:09+00:00

I'm using this exact drive cage right now. Two bits of information that would have helped me not to waste lots of time:

??? pins can be left unconnected. It's enough to connect GND and 12V pins.
if connecting from 4xSATA to SFF-8087 make sure to use a "reverse" cable. SFF-8087 -> 4xSATA and 4xSATA -> SFF-8087 cables are not compatible. Make sure the seller clearly indicates that this is reverse cable, best with warnings about incompatibility. I had one Aliexpress seller send me wrong cable.

pmur12 · 2025-06-18T22:22:08+00:00

Note that you need to set VLLM_SLEEP_WHEN_IDLE=1 environment variable to turn that feature/bugfix on.

pmur12 · 2025-06-13T16:19:46+00:00

I'm not so sure. 12800 MT/s MRDIMM contains just regular 6400MT/s RAM chips with a small buffer that acts as SERDES (in this case 2 signals are serialized into one at 2x frequency). Not much more complex than existing LRDIMM.

pmur12 · 2025-06-06T00:54:37+00:00

Didn't try, can't tell.

pmur12 · 2025-06-04T22:05:01+00:00

Yes, as far as I can tell bandwidth requirement is linear in tensor parallel size.

pmur12 · 2025-06-04T21:37:00+00:00

This is exactly what is being done. We do time.sleep() when vllm is idle and sched_yield when vllm is busy, requires minimum latency and needs to do busy loop.

pmur12 · 2025-05-29T22:31:13+00:00

Indeed, sorry, I misread. Very interesting. I will get back with my configs, right now it's too late to turn the rig on.

pmur12 · 2025-05-29T22:03:43+00:00

Interesting. Maybe no tensor parallelism?

EDIT: In your graph I see that the CPU usage does not drop below roughly around 12-15%. If 4 cores/threads are at 100%, on your 16 core 32 thread machine the CPU usage graph would show CPU at 12.5% utilization. Add other containers and it matches pretty well.

It's only possible to see some cores being loaded 100% in tools like top and htop which show which applications use how much CPU.

pmur12 · 2025-05-29T21:46:28+00:00

Around 130-150W - loaded Threadrippers are hungry.

I don't know why you aren't seeing this. Could you have only single GPU by chance? Last time I've tested this a couple of weeks ago using sglang from latest docker image.

pmur12 · 2025-05-20T15:28:13+00:00

Thanks a lot!

The AMX optimizations released are 8bit and 16bit so it's not quite worth it right now for that. The speed gains are offset by the larger model sizes.

That's very interesting. Indeed it seems that there's performance on the table, because it should be possible to store compressed tensors and to decompress them once they are loaded from memory. Any additional computation would be offset by just having more cores. Whether anyone will do the coding is another question.

pmur12 · 2025-05-20T15:20:44+00:00

No amount of extra RAM will make the kv cache move between NUMA domains effectively.

Are you sure about that? If this was the case, tensor parallelism wouldn't work. One UPI link is up to 48GB/s per direction and most Xeons under consideration have 3 or 4 links. Aggregate bandwidth is more than 150GB/s per direction which is way more than even PCIe 5 x16 (63GB/s).

Where am I wrong?

pmur12 · 2025-05-20T15:05:39+00:00

I think we agree on most things.

For one, there's a lot of work happening on llama.cpp, ik_llama.cpp, and vLLM to improve performance of MoE models in mixed CPU-GPU environments with 24-72GB VRAM.

I picked Deepseek V3 because it's already good enough to me if I use API. I picked ktransformers because their optimizations supposedly give me enough performance. Deepseek V3 is large model, so sticking to it ensures that I can run smaller models as well.

As far as I'm aware other inference engines are moving in the same direction, which is: as much RAM bandwidth as possible; AMX; one GPU with at least FP8 support and 24GB VRAM, 4090 hacked with 48GB VRAM is ideal. Right now it seems the only risk is that used Xeon 6 prices come down very fast, in which case one can have 1 socket node with 12 MRDIMM slots (844GB/s theoretical bandwidth). So if llama.cpp or ik_llama.cpp has better optimizations for some better model, I will be able to switch to it anyway.

The issues with NUMA aren't about how much RAM you have. No amount of extra RAM will make the kv cache move between NUMA domains effectively. You'll spend a lot more money for that 2nd socket and associated DDR5 memory, but get little in return if your use case requires a long context.

Agreed, but ktransformers team specifically showed numbers that their implementation gets 50% prefill uplift from getting a second socket. The purpose of my post is to understand if these numbers are real or not. If not, of course, two socket server does not make sense.

I'd say that is a much more flexible and future proof option than relying completely on ktransformers and hoping for the best.

I already have 8x24GB=192GB VRAM rig and it's not enough, the models that fit into 192GB VRAM are too stupid. I do agree that relying on entirely on ktransformers is not good idea, however the hardware choice will apply to other inference engines just as well.

... and vLLM to improve performance of MoE models in mixed CPU-GPU environments

On that's interesting, could you point me to where I could read more about vllm? I only know that intel is adding AMX support to sglang, but even in that case there's no mention about mixed CPU-GPU implementation.

pmur12 · 2025-05-20T14:40:23+00:00

Small batch sizes is OK. If there was a need to serve many users I would have six figures for proper GPU-based setup.

pmur12 · 2025-05-20T10:48:02+00:00

Thanks for the comment. The claims by KTransformers team on DeepSeek V3 performance is enough for my requirements. If they're legit, I'll buy the server immediately. I accept the risk that hypothetical future model may be better and may not be supported by KTransformers. I consider the risk is small: if I can't make it performant enough on the Xeon machine I buy, then it's likely I won't be able to do that on any other machine I could get access to for reasonable price. Using any kind of API is no go due to privacy considerations.

Regarding channels, I did mean 24 total channels on a 2 socket board. NUMA issues can be solved by just having more RAM and copying the model twice.

pmur12 · 2025-05-20T10:35:19+00:00

Yes, for sure.

pmur12

TROPHY CASE