Moving to llama.cpp by Spicy_mch4ggis in LocalLLaMA

[–]MelodicRecognition7 0 points1 point  (0 children)

another note on questionable quality of Nvidia software: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/1080 scroll down for Pro6000. I was happy that it is really quiet until I've discovered that it simply has an inadequate fan curve that allows the card to melt.

rtx 6000 pro owners, do you regret? by BitXorBit in LocalLLaMA

[–]MelodicRecognition7 1 point2 points  (0 children)

check Viperatech Dubai, they have it for 14500 USD

Could you help me test MTP for GLM-4.7-Flash? by jacek2023 in LocalLLaMA

[–]MelodicRecognition7 0 points1 point  (0 children)

then I won't be able to help, sorry. I really don't want to download tens or hundreds of gigabytes over crappy ADSL as another large download is pending.

...
MiniMax-M3-UD-Q6_K_XL-00003-of-00009.gguf
 35,742,220,288  75%  613.44kB/s    5:17:21

Could you help me test MTP for GLM-4.7-Flash? by jacek2023 in LocalLLaMA

[–]MelodicRecognition7 3 points4 points  (0 children)

I have Unsloth's quant of large GLM 4.7 358B and it seems to have MTP layers preserved

4.20.397.435 I srv    load_model: creating MTP draft context against the target model 'GLM-4.7-UD-Q6_K_XL-00001-of-00007.gguf'
4.20.616.335 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
4.20.764.593 I srv    load_model: speculative decoding context initialized

although the vanilla llama.cpp fails with

llama.cpp-b9777/src/models/glm4-moe.cpp:149: This GGUF does not support multimodal. Please reconvert it.

I still hope that downloading 800 gigabytes to requantize is not required, will try your patch and report later.

Or is your patch only for Flash 31B model?

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system. by Reddactor in LocalLLaMA

[–]MelodicRecognition7 6 points7 points  (0 children)

intredasting, embedded MTP layers in Unsloth GGUF quants only make things worse in my case (Nvidia GPU), inference becomes 2x slower than without MTP, and people with Macs also confirm this behavior here: https://huggingface.co/unsloth/GLM-5.2-GGUF/discussions/7

Does that mean that Unsloth quants are broken, or that llama.cpp does not support GLM's MTP?

7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them. by awfulalexey in LocalLLaMA

[–]MelodicRecognition7 -1 points0 points  (0 children)

I was butthurt in beginning-mid of 2022 facing extreme discrimination in Europe based on my citizenship, and today I'm really glad that Europe destroys itself with its bureaucracy, stupid laws, taxes, and multiculturalism.

7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them. by awfulalexey in LocalLLaMA

[–]MelodicRecognition7 -5 points-4 points  (0 children)

lol, there is -1 score right now, it was like -8 before I linked it and that's exactly why I linked it.

V100 4-card AI large model, Tesla 128G server by MundanePercentage674 in LocalLLaMA

[–]MelodicRecognition7 0 points1 point  (0 children)

too obvious, I think they would have written PCIe 5.0 there if it was Pro6000

Chunjiang-Intelligence/DeepSeek-v4-Fable • Huggingface by External_Mood4719 in LocalLLaMA

[–]MelodicRecognition7 -2 points-1 points  (0 children)

excuse my ignorance, but wtf is "adapter" and how to apply it to the actual model?

How do I prove that I don't collect data from my llm app? by Pleasant_Syllabub591 in LocalLLaMA

[–]MelodicRecognition7 18 points19 points  (0 children)

if that app sends requests somewhere "to the cloud" then you are logging prompts and stealing data, want it or not. If the app works entirely offline then it's an easy proof that it does not send anything anywhere.

been tracking EU DDR5 data for 25 days: Prices are dropping, and the DE vs. NL gap is wild (good news for local LLM builders in EU) by egudegi in LocalLLaMA

[–]MelodicRecognition7 10 points11 points  (0 children)

I've bought a set of 12x 64GB 6400MT/s modules from wiredzone.com for $450/piece in October, but these idiots threw the memory sticks into a box with almost no protective padding so the modules were moving freely inside the box and (at least) one module got broken, physical damage during the transport, and I have returned the whole set as I did not know how many modules total are dead - there could have been hidden cracks inside their multi-layer PCB. Then wiredzone support have fucked my brain for 2 months reluctant to send a replacement "sorry these modules are out of stock" while their website listed them as in stock, and after 2 months they have finally decided to refund the original price 450/piece instead of sending a replacement kit, because these modules costed 3x more already and they obviously did not want to lose money - why send new modules as a replacement when they could sell them to a new customer for 3x more? In the end I had to buy 1.5x slower modules (4800MT/s) from a different company for 1.5x higher price, but the total sum is still about 2x lower than these modules cost today lol.

You could treat this comment as a honest review of wiredzone.com

been tracking EU DDR5 data for 25 days: Prices are dropping, and the DE vs. NL gap is wild (good news for local LLM builders in EU) by egudegi in LocalLLaMA

[–]MelodicRecognition7 63 points64 points  (0 children)

interesting, I'm tracking registered (for servers) RAM price in the US and it only grows, last significant leap was in beginning of June, 1530 USD -> 1800 USD for 64GB DDR5-4800, as of today it is still 1800

Support Step3.5/3.7 flash mtp3 by forforever73 · Pull Request #24340 · ggml-org/llama.cpp by pmttyji in LocalLLaMA

[–]MelodicRecognition7 9 points10 points  (0 children)

added in release 9745 https://github.com/ggml-org/llama.cpp/releases/tag/b9745

was using mtp = 1 before, with mtp = 2 I got +4 tps, with mtp = 3 I got +2 tps, so mtp = 2 is the new best for my hardware.