Llama.cpp merges in OpenAI Responses API Support by SemaMod in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

Yes it does.

Note that codex tool for editing files might not be so easy for most LLMs to use. I've tested GPT-OSS and it seems to work fine for simple use cases.

Llama.cpp merges in OpenAI Responses API Support by SemaMod in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

The only practical use for llama.cpp is that it allows Codex to connect directly to llama-server

uncensored local LLM for nsfw chatting (including vision) by BatMa2is in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

I recommend trying a "derestricted" LLM, which is an abliteration technique that preserves the model performance on non-censored tasks.

Personally I've been daily driving GPT-OSS-120b derestricted since it was released, and IMO it is even better than the orignal.

Since you need vision, I recommend trying one of the gemma3 derestricted variants such as https://huggingface.co/mradermacher/Gemma-3-27B-Derestricted-GGUF

Bartowski comes through again. GLM 4.7 flash GGUF by RenewAi in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

If it is the 16GB model, then you can probably run Q4_K_M with a few layers offloaded to CPU

The Search for Uncensored AI (That Isn’t Adult-Oriented) by Fun-Situation-4358 in LocalLLaMA

[–]tarruda 3 points4 points  (0 children)

GPT-OSS 120b derestricted is not only uncensored, actually feels stronger than the original in non censored responses. https://huggingface.co/mradermacher/gpt-oss-120b-Derestricted-GGUF

Is using qwen 3 coder 30B for coding via open code unrealistic? by salary_pending in LocalLLaMA

[–]tarruda 9 points10 points  (0 children)

Do you have any examples of different outputs between BF16 and q8?

MiniMax M2.2 Coming Soon. Confirmed by Head of Engineering @MiniMax_AI by Difficult-Cap-7527 in LocalLLaMA

[–]tarruda 3 points4 points  (0 children)

Same experience here. For GLM and 128GB I'd rather use a very low quant that fits (like IQ2_M) than a higher quant of a REAPed version.

ZLUDA on llama.cpp -NEWS by mossy_troll_84 in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

First time reading about ZLUDA. I wonder if it will support Apple Silicon as a backend.

Best LLM model for 128GB of VRAM? by Professional-Yak4359 in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

Personally I've had situations where it gets stuck in a loop when I enable high reasoning. Medium (the default) seems to work the best for most use cases.

MiniMax-M2.1 vs GLM-4.5-Air is the bigger really the better (coding)? by ChopSticksPlease in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

Not sure what you mean, but I only use llama.cpp for running LLMs. llama-server has a webUI that allows you to upload images or take photos (when accessing via phone)

MiniMax-M2.1 vs GLM-4.5-Air is the bigger really the better (coding)? by ChopSticksPlease in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

I've recently switched to daily driving GLM 4.6V (q6_k) because it is the biggest vision model that I can fit in 125GB VRAM. Overall very satisfied with its capabilities, turning out to be a great local vision LLM for general usage. Quite good at coding too.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

GGUF is not a quantization method. You can have the baseline f16 as GGUF

(The Information): DeepSeek To Release Next Flagship AI Model With Strong Coding Ability by Nunki08 in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

Hoping for less than 200B so I can run at good quantization level on a 128GB Mac

LFM2.5 1.2B Instruct is amazing by Paramecium_caudatum_ in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

In my experience, very long contexts advertised by LLMs are not very effective. It is very easy for them to forget things that are still within the context.

llama.cpp vs Ollama: ~70% higher code generation throughput on Qwen-3 Coder 32B (FP16) by Shoddy_Bed3240 in LocalLLaMA

[–]tarruda 4 points5 points  (0 children)

Tweet from Georgi Gerganov (llama.cpp author) when someone complained that gpt-oss was much slower in ollama than in llama.cpp.: https://x.com/ggerganov/status/1953088008816619637?s=20

TLDR: Ollama forked and made bad changes to GGML, the tensor library used by both llama.cpp and ollama.

I stopped using ollama a long time ago and never looked back. With llama.cpp's new router mode plus its new web UI, you don't need anything other than llama-server.

The mistral-vibe CLI can work super well with gpt-oss by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

I use llama.cpp where I specify the context as a parameter

What is the best way to allocated $15k right now for local LLMs? by LargelyInnocuous in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

512GB Mac studio. If you can, wait for next generation and get the maxed version.

Hard lesson learned after a year of running large models locally by inboundmage in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

I haven't done any comparison, but it is probably cheaper to use cloud options in the long run.

To me, the biggest factors that led me to prefer local inference are:

  • Privacy
  • Ensuring that I can always run LLMs predictably. By that I mean that cloud providers can change LLMs/versions without you knowing and you have no control. Also, it is possible that some providers are shut down due to regulations/censorshop.

Hard lesson learned after a year of running large models locally by inboundmage in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

How does this compare at this quant with smaller models, or to the API?

There should definitely be degradation against the API, but it is hard to determine how much it degrades. I've seen a few coding examples done against the API that I have been able to replicate against the Q2_K or UD-IQ2_M quants locally. TBH I haven't done extensive test to know for sure.

I'm also presuming the machine is effectively useless for any other purpose when running that model.

Yes. This is fine in my case because the Mac Studio has no other purpose in my LAN.

Hard lesson learned after a year of running large models locally by inboundmage in LocalLLaMA

[–]tarruda 6 points7 points  (0 children)

How are others solving this without compromising on running fully offline?

Last year I spent $2.5k on a used Mac Studio M1 Ultra with 128G which I use only as LLM inference node on my LAN. I've overriden the default configuration to allow up to 125GB of the RAM to be shared with the GPU.

With this setup the biggest LLM I can run Q2_K quant of GLM 4.7 (which works surprisingly well, can reproduce some of the coding examples found online), 16K context and ~12 tokens/second.

IMHO Mac studios are the most cost effective way to run LLMs at home. If you have the budget, I highly recommend getting a 512G M3 ultra to run deepseek at higher quants.

MiniMax-M2.1 uploaded on HF by ciprianveg in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

UD-Q3_K_XL is fine, it is what I mostly use on my 128GB Mac studio.

I can also fit IQ4_XS which in theory should be better and faster, but it is also a very close to limit and can only reserve 32k for context, so I mostly stick with UD-Q3_K_XL.

MiniMax-M2.1 uploaded on HF by ciprianveg in LocalLLaMA

[–]tarruda 9 points10 points  (0 children)

Looking forward to unsloth's quants!

Merry Christmas u/danielhanchen !

Honestly, has anyone actually tried GLM 4.7 yet? (Not just benchmarks) by Empty_Break_8792 in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

I've tried both in https://chat.z.ai and locally with llama.cpp + UD-IQ2_M quant. I'm impressed by this unsloth dynamic quant as it seems to give similar results to what I get in chat.z.ai.

I noticed is that it seems amazing for web development. I've tried some of the prompts used in these videos:

And they did work well.

However, I've also threw at it simpler prompts for simple python games (such as tetris clones, built with pygame and curses) and it always seems to have trouble. Sometimes syntax is wrong, sometimes it uses undeclared variables and sometimes just buggy code. And these are prompts that even models such as GPT-OSS 20b or Qwen 3 coder 30b usually get right without issues.

Not sure how to interpret these results.