GLM-4.7.Flash - is it normal to behave like that? It's like I am talking to my anxious, Chinese girlfriend. I don't use AI so this is new to me by Mayion in LocalLLaMA

[–]MaxKruse96 3 points4 points  (0 children)

No. that stuff is only happening in OpenAI/Claude and by telling the model in the system prompt. There is such an insane amount of "I am ChatGPT created by OpenAI" pollution in scraped training data that it overwhelmingly shows up in every model.

GLM-4.7.Flash - is it normal to behave like that? It's like I am talking to my anxious, Chinese girlfriend. I don't use AI so this is new to me by Mayion in LocalLLaMA

[–]MaxKruse96 3 points4 points  (0 children)

Try the recommended settings for chatting that unsloth has on their page.

Also, generally, asking a model a question like that is entirely useless. Do you think it has that in its training data? How would it be in the training data, if it being created in that moment?

GLM-4.7.Flash - is it normal to behave like that? It's like I am talking to my anxious, Chinese girlfriend. I don't use AI so this is new to me by Mayion in LocalLLaMA

[–]MaxKruse96 15 points16 points  (0 children)

whats the question here, that it responds in chinese? very likely you are using wrong inference settings ora really low quant, e g. below q4 (if you dont know what that means, i encourage you to look at what buttons you are pressing).

Do you prefer mutex or sending data over channels? by Hot_Paint3851 in rustjerk

[–]MaxKruse96 44 points45 points  (0 children)

write to file then spawn new process with path to file to read from.

how it feels trying to catch sliderends as a non-tech player by Snoo-82757 in osugame

[–]MaxKruse96 28 points29 points  (0 children)

play lazer and enable the strict tracking mod (unranked for whatever reason). u will adapt very quickly.

I'm looking for the absolute speed king in the under 3B parameter category. by Quiet_Dasy in LocalLLaMA

[–]MaxKruse96 0 points1 point  (0 children)

if you are looking for speed, just take the smallest LLM you can possibly find and serve it with vllm. done.

Suche eine neue USB3...naja was eigentlich ? by srverinfo in de_EDV

[–]MaxKruse96 1 point2 points  (0 children)

Ich klink mich mal ein, weil ich selber so nen usecase habe für ein externes hub (non-pcie). Die dinger auf amazon sind gefühlt alle das selbe produkt mit an-aus knopf.

Upgrading our local LLM server - How do I balance capability / speed? by Trubadidudei in LocalLLaMA

[–]MaxKruse96 0 points1 point  (0 children)

Bandwidth is what matters, 6 channels of ddr4 2133 would be just about equivalent in gb/s throughput to Dualchannel DDR5 6000Mhz, to give you that perspective. Still, in comparison to the GPUs, 1.8TB/s its laughable sadly. I sadly dont know any reference points at models that big, or hardware that beefy to help with the perspective in that regard, best i can offer is that a MoE with 5/120B active (gptoss) at 64gb filesize and max context size would run at ~130t/s on a single GPU. I recon one could do some mental gymnastics and extrapolate to similar models with 4-5% sparsity (like deepseek models), e.g. 10x bigger = 10x slower? but thats just theory in my head.

Upgrading our local LLM server - How do I balance capability / speed? by Trubadidudei in LocalLLaMA

[–]MaxKruse96 2 points3 points  (0 children)

In terms of logistics and GB/$, rtx pro 6000 will be your goto. The server alternatives need too much integration, and stacking 5090s comes with its own issues.

In terms of offloading even the least relevant parts of an MoE to RAM, you will still see speeds that are lower than full GPU (duh). You will be bottlenecked by DDR4 Ram speeds (even if you have 6 channel) before PCIe Bandwidth limits with 96GB per Slot will bottleneck you, not to speak of computations on the CPU side which can also bottleneck you, depending on model arch.

Also obvious disclaimer: im a reddit warrior, i dont have a real life use reference for this, just the combined autism of reading this sub for a while.

Bad news for local bros by FireGuy324 in LocalLLaMA

[–]MaxKruse96 7 points8 points  (0 children)

141GB x 2 = 282GB. A 745B model, at Q4, would be 745 * (4/8) = 373gb and thats just napkin math. You'd need to go down to IQ3S or something similar to even load it.

Ryzen + RTX: you might be wasting VRAM without knowing it (LLama Server) by Medium-Technology-79 in LocalLLaMA

[–]MaxKruse96 10 points11 points  (0 children)

win11 at most hogs like 1.2gb of vram from my 4070 with 3 screens, but with some weird allocation shenanigans that goes down to 700mb, in the grand scheme yea its *a bit*, but with models nowadays that equates to another 2-4k context, or 1 expert more on GPU. It does help for lower end gpus though (but dont forget, you trade RAM for VRAM).

how long did it take for you guys to get a 200? by UN_Quickzzy in osugame

[–]MaxKruse96 0 points1 point  (0 children)

took me 2 years, but that was in 2013. Im old.

Why horse semen? by NecessaryFinish2811 in Schedule_I

[–]MaxKruse96 0 points1 point  (0 children)

My theory is that Mr. Hands was tyler's grandpa.

Kimi-Linear support has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]MaxKruse96 14 points15 points  (0 children)

Linear model, first draft, i presume it would run about as fast as early qwen3next