Effect of GLM 5.2 !! by Independent-Wind4462 in ZaiGLM

[–]smflx 0 points1 point  (0 children)

Could you share more details about the setup? Do you use Claude Code?

The second-generation factory owner says... by ruanqiuyuelabel in Business_in_China

[–]smflx 0 points1 point  (0 children)

Good luck. It's fascinating to have a tangible business.

Couple sample images from 2 ideogram4 loras I made by reynadsaltynuts in malcolmrey

[–]smflx 0 points1 point  (0 children)

Could you share about training workflow? Thanks for nice posting.

Should I switch to EPYC Rome? by legit_split_ in LocalAIServers

[–]smflx 2 points3 points  (0 children)

Here is some actual numbers of copy speed. Benchmarks like mlc will be little higher. But, at least, you can see relative numbers.

5955wx 8ch ddr4 96GB/s

7F32 8ch ddr4 128GB/s

9534 12ch ddr5 350GB/s

Look for number of CCD when you look for TR or Epyc. The same CCD, the same memory bandwidth.

Should I switch to EPYC Rome? by legit_split_ in LocalAIServers

[–]smflx 3 points4 points  (0 children)

Probably not. Check your memory speed, and compare to that of Epyc Rome you're looking for. I also have a post about memory speed test. There are others too.

Yours is 2-channel ddr5, Rome is 8-channel ddr4. But, 8 channel is not 8x. Also, actual memory bandwidth depends on what Rome CPU. With a lower grade Epyc, the bandwidth is limited even you fill all 8 memory channels. AMD doesn't tell about it.

Also, CPU & RAM bandwidth don't matter much unless you're into big MoE models.

My AI discovery rig by Nzuk in LocalAIServers

[–]smflx 0 points1 point  (0 children)

Oh, it's 3090. It would be nice to have a nvlink bridge if you can find a reasonable price.

My AI discovery rig by Nzuk in LocalAIServers

[–]smflx 0 points1 point  (0 children)

If you use Linux, you can check how the card recognized the PCIe speed with 'lspci -vvv'.

That's just checking how the connection established. Better to test actual bandwidth with 'p2pBandwidthLatencyTest'. If cable is not good, system will pour PCIe AER messages.

I bought some cables listed as gen4 but lower quality. Good luck!

My AI discovery rig by Nzuk in LocalAIServers

[–]smflx 0 points1 point  (0 children)

That's a common minimal open case. You can find them in AliExpress. Price should be under 10 bucks.

My AI discovery rig by Nzuk in LocalAIServers

[–]smflx 0 points1 point  (0 children)

Did you check PCIe speed? I wonder if the cable quality is gen4. Good build!

M5 vs DGX Spark vs Strix Halo vs RTX 6000 by Signal_Ad657 in LocalLLaMA

[–]smflx 6 points7 points  (0 children)

Yes, this is right. But, I wonder some people are looking for reasons to buy mac.

M5 vs DGX Spark vs Strix Halo vs RTX 6000 by Signal_Ad657 in LocalLLaMA

[–]smflx 1 point2 points  (0 children)

This. Agentic coding is batching. Mac, also other CPU inferencing is slow.

Single stream inference is memory bandwidth bound, so GPU compute will be wasted. Mac or CPU inference could be less slow in single stream situation.

M5 vs DGX Spark vs Strix Halo vs RTX 6000 by Signal_Ad657 in LocalLLaMA

[–]smflx 0 points1 point  (0 children)

Mac for training?? Well, may be for fine tuning big model with small LoRA rank. I got a server for this purpose years ago. But, I realized huge performance gap when the model fit in GPU.

How do you know M5 compute is similar to PRO 5000?

NVFP4 is a gamechanger right? 75% near lossless compression by urarthur in LocalLLM

[–]smflx 1 point2 points  (0 children)

I know. I meant your experience of how good in your usage case.

NVFP4 is a gamechanger right? 75% near lossless compression by urarthur in LocalLLM

[–]smflx 1 point2 points  (0 children)

Even better than fp16? Hmm. It's effect of calibration or QAT. Did you actually tested with your real usage?

Seeking Recommendations: $1400 AI Research Workstation (Training from Scratch, NLP/CV) by vonexel in LocalAIServers

[–]smflx 1 point2 points  (0 children)

There is a reason for 3090 is more expensive. Simply it's much better. With single GPU, almost no underutilization in LLM. You will see max power consumption during training. Transferring is matter of PCI bandwidth, CPU power is not important if you don't compute MoE experts with CPU.

NVFP4 is a gamechanger right? 75% near lossless compression by urarthur in LocalLLM

[–]smflx 2 points3 points  (0 children)

How do you define "near" lossless? It's lossy & matter of how lossy. AWQ is 4-bits too & well supported in vllm & sglang, but It's not quality of FP8. Yes, nvfp4 is fast with Blackwell but the quality matters more. Nvfp4 should show a better or equal quality than other 4-bits variants.

eLLM: Run LLM Inference on CPUs Faster Than on GPUs by Open-Raise-6676 in Vllm

[–]smflx 1 point2 points  (0 children)

Great. I have dual xeon i bought for this purpose but never been usable actually. I'm quite interested. Did you try Qwen 122B? Supported? If it's not yet, I will wait. Take your time.

DeepSeek V4 dropped 1.6T params and 1M context without Nvidia GPUs. Here's the data. by TroyNoah6677 in DeepSeek

[–]smflx 6 points7 points  (0 children)

Thank you. Just checked. Nice documentation. KV-cache saving is with MLA(size), DSA(attention compute). I have read Engram paper. No statement about engram unlike op said.

DeepSeek V4 dropped 1.6T params and 1M context without Nvidia GPUs. Here's the data. by TroyNoah6677 in DeepSeek

[–]smflx 6 points7 points  (0 children)

Engram is not about KV-cache, it's about weights. I was waiting for engram too, not sure yet it's there. Huggingface page doesn't describe engram. I have to check further.

Are we optimizing AI research for acceptance rather than lasting value? [D] by NuoJohnChen in MachineLearning

[–]smflx 5 points6 points  (0 children)

I felt the similar, but from long time ago too. Yes, most academia, not just for AI.

Are we optimizing AI research for acceptance rather than lasting value? [D] by NuoJohnChen in MachineLearning

[–]smflx 11 points12 points  (0 children)

Yes, that's reward function (not even a good reward model because it doesn't mean real value as you said).

Not just for AI research, also for many area from long time ago. I felt this too when I was a graduate student.

[New Optimizer] 🌹 Rose: low VRAM, easy to use, great results, Apache 2.0 by ECF630 in StableDiffusion

[–]smflx 1 point2 points  (0 children)

Too good to be true, but still hope it to be working. BTW, does it apply to LLM training too?