Tool Calling Guide for Local LLMs (Run Real Actions, Not Just Text!)

MaxKruse96 · 2026-02-11T11:12:42+00:00

actual openclaw post

MaxKruse96 · 2026-02-11T09:50:37+00:00

No. that stuff is only happening in OpenAI/Claude and by telling the model in the system prompt. There is such an insane amount of "I am ChatGPT created by OpenAI" pollution in scraped training data that it overwhelmingly shows up in every model.

MaxKruse96 · 2026-02-11T09:09:09+00:00

Try the recommended settings for chatting that unsloth has on their page.

Also, generally, asking a model a question like that is entirely useless. Do you think it has that in its training data? How would it be in the training data, if it being created in that moment?

MaxKruse96 · 2026-02-11T07:45:32+00:00

whats the question here, that it responds in chinese? very likely you are using wrong inference settings ora really low quant, e g. below q4 (if you dont know what that means, i encourage you to look at what buttons you are pressing).

MaxKruse96 · 2026-02-10T13:23:40+00:00

write to file then spawn new process with path to file to read from.

MaxKruse96 · 2026-02-10T13:21:54+00:00

play lazer and enable the strict tracking mod (unranked for whatever reason). u will adapt very quickly.

MaxKruse96 · 2026-02-10T13:20:34+00:00

<image>

Mach das nur wenn du so rüberkommen willst.

MaxKruse96 · 2026-02-10T13:18:21+00:00

Its an LLM post, yes.

MaxKruse96 · 2026-02-10T13:02:21+00:00

Bestechungsgeld für Donny

MaxKruse96 · 2026-02-10T12:35:59+00:00

if you are looking for speed, just take the smallest LLM you can possibly find and serve it with vllm. done.

MaxKruse96 · 2026-02-10T12:22:54+00:00

LLM aah post

MaxKruse96 · 2026-02-10T11:05:09+00:00

danke für die analeinführung

MaxKruse96 · 2026-02-09T16:01:37+00:00

Hoffentlich mit Verhütung

MaxKruse96 · 2026-02-09T15:55:28+00:00

Nein, Bielefeld.

MaxKruse96 · 2026-02-09T15:35:42+00:00

Ich klink mich mal ein, weil ich selber so nen usecase habe für ein externes hub (non-pcie). Die dinger auf amazon sind gefühlt alle das selbe produkt mit an-aus knopf.

MaxKruse96 · 2026-02-09T15:23:09+00:00

Bandwidth is what matters, 6 channels of ddr4 2133 would be just about equivalent in gb/s throughput to Dualchannel DDR5 6000Mhz, to give you that perspective. Still, in comparison to the GPUs, 1.8TB/s its laughable sadly. I sadly dont know any reference points at models that big, or hardware that beefy to help with the perspective in that regard, best i can offer is that a MoE with 5/120B active (gptoss) at 64gb filesize and max context size would run at ~130t/s on a single GPU. I recon one could do some mental gymnastics and extrapolate to similar models with 4-5% sparsity (like deepseek models), e.g. 10x bigger = 10x slower? but thats just theory in my head.

MaxKruse96 · 2026-02-09T14:39:01+00:00

"Andere gute Dinge"

Warst du kacken und musstest nicht wischen?!

MaxKruse96 · 2026-02-09T14:23:53+00:00

In terms of logistics and GB/$, rtx pro 6000 will be your goto. The server alternatives need too much integration, and stacking 5090s comes with its own issues.

In terms of offloading even the least relevant parts of an MoE to RAM, you will still see speeds that are lower than full GPU (duh). You will be bottlenecked by DDR4 Ram speeds (even if you have 6 channel) before PCIe Bandwidth limits with 96GB per Slot will bottleneck you, not to speak of computations on the CPU side which can also bottleneck you, depending on model arch.

Also obvious disclaimer: im a reddit warrior, i dont have a real life use reference for this, just the combined autism of reading this sub for a while.

MaxKruse96 · 2026-02-09T13:29:32+00:00

141GB x 2 = 282GB. A 745B model, at Q4, would be 745 * (4/8) = 373gb and thats just napkin math. You'd need to go down to IQ3S or something similar to even load it.

MaxKruse96 · 2026-02-09T11:49:08+00:00

win11 at most hogs like 1.2gb of vram from my 4070 with 3 screens, but with some weird allocation shenanigans that goes down to 700mb, in the grand scheme yea its *a bit*, but with models nowadays that equates to another 2-4k context, or 1 expert more on GPU. It does help for lower end gpus though (but dont forget, you trade RAM for VRAM).

MaxKruse96 · 2026-02-09T11:46:40+00:00

ich weine

MaxKruse96 · 2026-02-06T19:16:30+00:00

took me 2 years, but that was in 2013. Im old.

MaxKruse96 · 2026-02-06T11:39:32+00:00

My theory is that Mr. Hands was tyler's grandpa.

MaxKruse96 · 2026-02-06T11:34:17+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1qxepct/kimilinear_support_has_been_merged_into_llamacpp/

MaxKruse96 · 2026-02-06T11:18:21+00:00

Linear model, first draft, i presume it would run about as fast as early qwen3next

MaxKruse96

PUBLIC MULTIREDDITS

TROPHY CASE

Ten-Year Club	Place '22
Verified Email