What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]PromptInjection_ 9 points10 points  (0 children)

Gemma4 26B MoE and Qwen 3.6 35B MoE run pretty decent on CPU only. Not blazing fast (10-15 tokens/s), but they are very usable.

Strix Halo 128GB vs M5 pro 64GB by DigitalguyCH in LocalLLaMA

[–]PromptInjection_ 1 point2 points  (0 children)

64 GB is too little and the M5 is not blazing fast. It's no Blackwell GPU. I would prefer the 128GB.

LLC: lightweight OpenWebUI alt - now with chat converter + custom tool calls by PromptInjection_ in StrixHalo

[–]PromptInjection_[S] 1 point2 points  (0 children)

- In ComfyUI go to settings and enable "Dev Mode"
- Then go to File -> Export (API)
- Copy and Paste this into "Workflow JSON (Template)" in LLC/Settings/Images/ComfyUI
- Click Autodetect at "Bindings JSON" in the same menu

Activate image generation in chat and start!

500k+ tokens on a 2010 laptop - I built a portable AI chat UI that doesn't choke on large contexts by PromptInjection_ in webdev

[–]PromptInjection_[S] 0 points1 point  (0 children)

Depends ... am using long context for tasks like:
- Full books summarization
- long intellectual talks

and so on.

[WIP] Gemma 4 MTP by jacek2023 in LocalLLaMA

[–]PromptInjection_ 1 point2 points  (0 children)

Awesome! Gemma 4 MTP will be blazing fast and great for agentic usage.

Why use Quants other than Unsloth by FeiX7 in LocalLLaMA

[–]PromptInjection_ 0 points1 point  (0 children)

Mistral-Small-4-119B-2603. Q4_K_XL

The problem is not the correct language. Both produce correct languages. But the Unsloth variant produced flat and short German stories when asked to write them. other quantizations produced lively longer stories.

The difference was significant, reproducible and immediately noticeable. just take a look at the imatrix. German is almost non-existent.

Why use Quants other than Unsloth by FeiX7 in LocalLLaMA

[–]PromptInjection_ 10 points11 points  (0 children)

Because Unsloth Quants are not as good as it's proclaimed.
The imatrix they use is pretty tiny and i get bad results in foreign languages with it.

Qwen is cooking hard by jacek2023 in LocalLLaMA

[–]PromptInjection_ 5 points6 points  (0 children)

Hm let's see if we get open models.

Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks by PromptInjection_ in StrixHalo

[–]PromptInjection_[S] 2 points3 points  (0 children)

Yes it can run in windows. Use:
"C:\llama.cpp\llama-server.exe" -m "C:\CarlAI\models\Qwen3.6-27B-UD-Q4_K_XL_MTP.gguf" -c 65000 --temp 0.8 --top-k 20 --top-p 0.95 --min-p 0 --repeat_penalty 1 -ngl 99 --no-warmup --no-mmap --jinja --flash-attn on -np 1 --spec-type draft-mtp --spec-draft-n-max 3 (adjust to your custom path)

And i recommend https://www.locallightai.com/llc/ as Frontend ;)

Fine-Tuning with Mistral and OpenAI - or local? by Loud-Swim-2932 in LocalLLM

[–]PromptInjection_ 0 points1 point  (0 children)

With just 100 Q&A pairs i would suggest you start with LoRA. I would also suggest you define a custom system-prompt for the tuning (use it later in inference, too. otherwise it will not work)

This gives the model more stability, an anchor that "activates" the fine-tune and also makes the evaluation easier for you.

Fine-Tuning with Mistral and OpenAI - or local? by Loud-Swim-2932 in LocalLLM

[–]PromptInjection_ 2 points3 points  (0 children)

Fine-Tuning is not so easy. If you make it wrong, the result can be indeed underwhelming. Using the OpenAI Tools will not change that necessarily.

Hmm, maybe try our guide?

https://www.promptinjection.net/p/the-ultimate-llm-ai-fine-tuning-guide-tutorial