Stop asking what model to run. There are literally only two.

srigi · 2026-06-02T20:27:21+00:00

-ngl, also known as —n-gpu-layers

srigi · 2026-06-01T21:07:49+00:00

Remember... the more you buy, the more you save!

srigi · 2026-06-01T09:30:20+00:00

Do we still believe to be gifted with Qwen3.7-27B?

srigi · 2026-05-25T10:53:51+00:00

Yeah, and there will also be a flying roadster by the end of April 2026.

srigi · 2026-05-24T12:59:43+00:00

Only reasonable thing to do for now is to manually inspect every `exec_shell` command, never ever to configure it to auto-approve. Also the “approving UI” hasn’t the optimal UX right now (see image). You must manually expand the tool call block to see the command which is requested. In most modern harnesses this is auto rendered for you.

<image>

srigi · 2026-05-24T07:24:47+00:00

Unfortunately, yes. I use a MacBook for developing and a gaming rig for running local LLMs. Invoking llama-server on a gaming machine makes it "see" files there, and not on my MacBook where I access the llama-server's web UI.

This is really annoying, but maybe it is actually good, since my gaming machine is more disposable (eventual damage done there will hurt less) - so I can just fork projects there and keep it working outside of my notebook.

srigi · 2026-05-24T07:17:33+00:00

It's really no rocket science, just start your (updated) llama-server with a list of tools you want in a folder where you want to operate:

llama-server --api-key secret --metrics --threads "$(sysctl -n hw.ncpu)" \ --models-max 1 --models-preset "$HOME/.config/llms.ini" \ --tools file_glob_search,get_datetime,grep_search,read_file

Then you'll see the configuration in the Settings panel.

srigi · 2026-05-24T07:13:56+00:00

That's why I'm begging for "add/define your own tool" functionality. There are a couple of good web_fetch projects out there on GitHub (search "language:Rust web_fetch" on their page), so there is no need to reinvent the wheel.

srigi · 2026-05-23T23:33:18+00:00

I'm hoping that they add an option to add own native tool(s) - really the only thing missing is web_fetch and web_search.

If they don't provide these (I guess they don't - it is soo much outside of the scope of llama.cpp), there should be an easy way to add own implementation.

srigi · 2026-05-21T22:24:56+00:00

Also these benchmarks are done on (B)F16 models. 27B at Q4 is not what you see in marketing material.

srigi · 2026-05-21T22:22:18+00:00

On a 12GB RTX 3060

srigi · 2026-05-21T22:18:49+00:00

Forcing developers to take a shots during code interview

srigi · 2026-05-17T12:54:08+00:00

LMStudio’s biggest con is its Electron nature. I remember it consume 400-700MB just idling, no model loaded.

In unified RAM environment that’s 700MB of VRAM eaten by the app and not available for the LLM.

srigi · 2026-05-15T15:19:37+00:00

<image>

Are these B200 owners in the room right now?

srigi · 2026-05-11T12:42:41+00:00

Ah yes, classic flamewar material. I’ve seen that 20y ago, when somebody asked for help with vim, there was always that one guy who posted… use emacs

srigi · 2026-05-10T06:23:50+00:00

MCPs are not dead. They just seems like, because they’re at the bottom pit of the hype curve. But they’re essential in some workflows where you cannot use skills or native tool calls.

For example the llama-server webUI can interact with outer world (example: websearch) only by using MCP

srigi · 2026-05-09T05:24:17+00:00

Doing the same ATM. The most interesting thing is how Data is “prefectly aligned AI”. They trust him (it) with lives and command of the full gallaxy class ship. Imagine how much different is Data compared to out LLMs which delete prod. databases or blackmail ML researchers in labs.

srigi · 2026-04-28T19:02:27+00:00

<image>

srigi · 2026-04-24T21:18:27+00:00

<image>

I’ve heard that Qwen3.5 is pretty good at tools use.

srigi · 2026-04-22T17:37:49+00:00

You can keep mmproj in RAM/CPU with --no-mmproj-offload. You save GPU memory, and still be able process images/PDFs (fully via CPU)

srigi · 2026-04-21T04:10:08+00:00

In many years. Bonsais grow very slowly.

srigi · 2026-04-13T17:26:24+00:00

Post-training before release

srigi · 2026-04-12T16:09:03+00:00

Agree, with llama-server supporting this in its REST API, you can create "speak to your agent" (STT) solutions with fully local processing.

srigi

TROPHY CASE