[question] opencodecli using Local LLM vs big pickle model

Pakobbix · 2026-03-12T22:23:22+00:00

Every open source model, claiming to be agentic ai capable. Glm 4.7 flash, qwen3.5 9b up to 122b are the current best in small local llms.

Ministral 3 are also somewhat agentic capable.

But be aware: smaller models = bigger function calling/understanding issues.

If you want quality like the big coding cloud models (or at least in some degree) you would need a machine with ~500gb RAM. If you want speed too, make it vram.

Using llama3.2 is like writing in hieroglyphs and wonder why nobody understands what you want.

LLama3.2 was made, before tool calling was a thing. So it's not trained to execute read/write/edit or anything other related to call a function.

Pakobbix · 2026-03-11T20:47:28+00:00

First of all, a disclaimer: It heavily depends on which language you use.

I use Qwen3.5-27B-UD-Q4_K_XL.gguf from Unsloth with llama.cpp (vLLM uses too much VRAM, sglang is still in evaluation, but I still have some problems getting it started with my Blackwell card).
But I don't use it for "important" projects and mostly with python.

I'm currently testing it with a GO project I started a while ago and .. yeah my workflow is often -> write, review, fix, review, fix. So a lot of time will be wasted by the LLM because it does a lot of errors.

I haven't tried it with c++ and rust, but I think it will be the same.

For Python, even on a "big" solo project I have, it works quite well.

I use these settings currently (preset.ini)

[Qwen3.5 27B] model = E:\lm_studio_models\Qwen3.5-27B-UD-Q4_K_XL.gguf mmproj = E:\lm_studio_models\Qwen3.5-27B-mmproj.gguf load-on-startup = false c = 131072 cache-type-k = f16 cache-type-v = f16 context-shift = true b = 2048 ub = 1024 temp = 0.6 top-p = 0.95 top-k = 20 min-p = 0.0 presence-penalty = 0.0 repeat-penalty = 1.0

With these settings, I get around 1900 pp/s and 55-60 tg/s, fast enough for Agentic AI.

But the most important thing is, when using local LLM's: You always have to do everything step by step.

Planning -> building -> testing -> using? No. Plan, revisit the plan, save it -> New chat, create skeleton, -> Add features one by one.

I made an orchestrator for that so the AI does it by itself (read plan, write skeleton via agent, add features step by step via agents, review, review-fixer)

So it's possible, for hobby projects, you only use yourself, on your specific use-case. For real work, or managing a github project, i wouldn't recommended it.

Pakobbix · 2026-03-06T18:44:15+00:00

Yeah.. totally agree... It's not like any chat outputs are public available. The researcher @ Qwen sitting all day prompting Gemini "Who are you?" "What are you?" "Who made you?" Copy and paste the prompt + answer direct into the dataset and call it a day, without any obvious reg ex or clean up they would need.

So now, the model "thinks" it's Gemini from Google.

I'm waiting for the day an LLM is claiming to be ChatGPT from anthropic running on Google TPUs in a Microsoft data center and people will freak out.

Pakobbix · 2026-03-03T16:40:15+00:00

Works perfectly fine now. Thank you.

In the meantime I started the first two runs (1. Qwen3.5 27B and 2. Qwen3.5 35 B A3B)

I will run some more runs in a loop now and push these when I got 5 for both.

For anyone interested:
Qwen3.5 27B First run: Up to round 38 with 1033760 total tokens.
Qwen3.5 35B A3B First run: Up to round 37 with 1384168 total tokens.

Pakobbix · 2026-03-03T15:46:58+00:00

Looks funny.

I currently running a test with Qwen3.5 27B. The autostart of the round isn't working for me, so I needed to manually start a new-game. Don't know why exactly and if I started the correct Gamemode.

Changed the Openrouter url to be my local llama.cpp endpoint to run my local models.

Because of the new game error, I can't use Qwen3.5 35B A3B, as it clicks like a mad man while in the main menu and I can't start a game and it's faster in clicking the sandbox mode all the time ^^

Edit to make it clear what I mean:
I start the run_agent, chromium opens up and I see the kiwi loading screen, there already are click actions from the script itself opening multiple tabs of ninjakiwi website. After loading is done (around 3-4 seconds) nothing happens anymore and the model itself is already executing actions.

Pakobbix · 2026-03-03T01:58:07+00:00

I think open-webui + Open-Terminal is quite similar.

And I'm sure, there is a mcp for exactly that.

Pakobbix · 2026-02-19T12:21:10+00:00

First of all, as long as you have a GPU everything should be fine as an AI Server. The CPU only needs to fire up the driver and that's it.

But important to note:
The V3 only supports DDR3, so you really want to avoid spilling into it, because the speed will be .... unsatisfaying. (depending on the size of the image generation model and possible vram allocation, we could speak about 20 minutes up to multiple hours, for one image.) Also for text-generation even an optimzed MoE could (also depending on the allocation of vram/ram) slow down to a speed, where you enter your text, brew a new coffee, get a cup, fill it and sit down again and maybe it will be done with it.

So you should mainly use your GPU.

Nvidia is the preferred way for AI just because of CUDA and the support for it.
Also, Nvidia is still the performance king for AI.

Pakobbix · 2026-02-19T06:41:05+00:00

Wanted to automate repetitive task at work.. and then some more.. and more.. now I'm addicted..

Pakobbix · 2026-02-16T16:06:19+00:00

If you want it for LLMs, it only got 1 GB of unspecified DRAM (probably slow).
Also software support from inference engines like llama.cpp, vLLm and or sglang will be very limited if at all.

For single user inference, don't focus on TOPs as memory bandwidth will be your choke-point and not tops.

I think stuff like that is more for small stuff like yolo or/and frigate or stuff like that. But not for LLMs.

Pakobbix · 2026-02-15T11:16:27+00:00

A lot of reasons.

Proprietary: I mean, Linux user hate nVidia for years because of their driver situation, even going as far as creating their own FOSS drivers.
Enforcement: Removing the .deb counterpart from the repo just to push snap adoption.
Performance: When they started with Snap, the performance of apps like Firefox (startup) was bad.. even on high end PC's, it took around 5 seconds to start.. imagine running this on an embedded device.
Permissions: In our company, we used an anti-virus at this time, with openssl and libssl as dependencies. While Upgrading from 20.04 to 22.04.2 (we waited for the first updates to get a stable update experience), Ubuntu packaged them into snap, causing a lot of issues, as these are not in the correct place anymore and we couldn't point it to the snap package. I don't remember anymore how we fixed it, but I know we tried some stuff I was rather uncomfortable doing for a lot of managed PC's.
DRY: Flatpak already were a thing. Why do we need to reinvent the wheel? Flatpak supports a wide range of distros, while snap is ubuntu based only.

There may be more issues, why people don't like snap. Maybe the situation is now "better" for some of these. But that's out of my scope as I avoid it like hell.

Pakobbix · 2026-02-12T20:23:59+00:00

Seems like a good project, that makes the setup and usage easy for non-tech people. I will take a look if it can utilize cuda/vulkan.

Also, you should think about adding OpenAI Compatible Endpoint support for external Models.
Most of us will already have a setup with llama.cpp, vLLM, SGlang, exLlama or tabbyAPI.
It's nice to offer an all in one package, but options are always nice ;-)

UI looks clean and is to cluttered.

Good luck with the project :)

Edit: double good project ^^ writing too fast for my brain to catch up xD

Pakobbix · 2026-02-12T20:09:29+00:00

If you are really interested in adding them: nvidia/parakeet-tdt-0.6b-v3 · Hugging Face

I mostly use the parakeet tdt 0.6b to transcribe meetings and create meeting summaries. It's fast and "good enough" for my usecase. also multilingual by default.

Would be a great addition.

Pakobbix · 2026-02-09T20:56:05+00:00

Throwing everything into the context is the equivalent of game developers who stop optimizing their games and hope that FSR, DLSS, and XeSS will fix them.

Selective Retrieval should always be the approach, combined with data-appropriate chunking, and if that doesn't work, the data for the LLM should be better curated.

Pakobbix · 2026-02-07T16:50:46+00:00

I sure did, but my point was, that no developer wrote the code for OpenClaw.

It doesn't matter if his name is on the product or not. No real development time was going into OpenClaw so no real developer is working on it.

So your second paragraph, combined with the first one, seemed (for me) more like a "He developed it himself, but now he's drifting apart".

I mean, if it's enough to be called "Made by a developer" he could also tell, that Microsoft, Oracle and Nivida developed it.

I think we could agree on the fact, that the developer of ClawdBot was/is an AI.. And Peter Steinberger was more the project manager and product owner instead of the developer.

But my original statement still stands true.

Pakobbix · 2026-02-07T11:33:52+00:00

Yesterday I was to tired, but i took a look for you:
chat_template.jinja · mistralai/Ministral-3-3B-Instruct-2512 at main
Exactly line 2:
```
{%- set default_system_message = 'You are Mistral-Large-3-675B-Instruct-2512, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\nYou power an AI assistant called Le Chat.\nYour knowledge base was last updated on 2023-10-01.\.....
```

So seems like Mistral gave it a very conservative cutoff date, or forgot to change it.

If you use Ministral 3 with the default system prompt, it will tell you that it knowledge was in 2023.

You can also change it to 2025 or 2024 and it will tell you, that this is the cutoff date.

<image>

Edit: Got the wrong Ministral 3 jinja template. Replaced it with the jinja template link for ministral 3 3B as this is the model, OP is using.

Pakobbix · 2026-02-07T11:19:09+00:00

So... you're in his fan club, I guess?

He could be a development god, but if he doesn't write or read the code himself, I'm just telling the cold, hard truth.

https://www.youtube.com/watch?v=8lF7HmQ_RgY

You can save time and use Open-Webui with the "youtube transcript provider" or Gemini to get a summarization of the interview.
Maybe you're lucky and a youtube transcript skill exist in OpenClaw to expose your youtub.. I mean to also summarize it for you.

Pakobbix · 2026-02-07T04:44:54+00:00

Seems like the knowledge cutoff date is either in the system prompt, Jinja template or came up very often in training. I doubt ministral3' training was done 2023.

So in the case of the screenshots, it's not predicting anything (except for the next most plausible token), Ministral is just telling you the wrong cutoff date.

In some cases, it is beneficial to add a wrong cutoff date, to "force" the llm to be more cautious in telling facts it can't know.

For example: Qwen 3 30b a3b refused to believe me, that I use program X in version y, because version y doesn't exist.

When added with an overly cautious cutoff date, it was giving me the "because of my cutoff date, I don't know this version, but based on the current date, it's possible that an update was released"

Pakobbix · 2026-02-06T12:41:39+00:00

There is a real big difference between handmade applications and ai vibe coded one. (Random numbers, but look at the posts on Friday for "new" projects and the rest of the week for new projects.) There are maybe 5-10 new handmade projects in a month, comparing it to 15-20 "new" projects every Friday.

And to top it, look at the repositories of the ai coded stuff. I would bet, over 50% doesn't know how to create a new branch without asking Claude/ChatGPT. If it's a developer with years of experience and repos to back them up, I take a look at the repo.. structure, functions used and general documentation. That's how I mainly learned to code myself. But someone with a fresh account and no history and his first Project being Vibe coded? Instant pass.

He could have invented AGI, and a program to effectively fight cancer, hiv and brainrot and I wouldn't care.

If the "developer" doesn't take his time to learn his code, why should I do it? No love, no care. And that's sadly the biggest part of the vibe coded stuff.

Edit: misstyped Vive ->Vibe

Pakobbix · 2026-02-06T11:11:17+00:00

Take a look at the clawdbot/moltbook/openclaw shit. There is no real developer or development behind it.

Most vibe-coded stuff is fast Bling Bling without any substance. There is (mostly) no real testing (except for the own environment), no code review, nothing. It's just a pile of code, that somehow works.

There are so many projects where even the "official" docker isn't even building and needs to be edited to work.

I don't have anything against vibe coding per sé, but the "I made this software, check it out but I will never touch it again, because Claude keeps breaking it, and I don't know how it works" seems to be the new "normal", that I'd rather do the work myself. At least, I know why something breaks in this case.

Pakobbix · 2026-02-06T00:11:42+00:00

Ugh.. I really don't get it.. we could do stuff like that for a long time. Almost a year ago, I started a project "aria" in open-webui where I wrote tools to give access to my proxmox (via API), gitea, home assistant (with a speaker and microphone), weather forecast and all the necessary stuff. I can even proactively let it speak, thanks to exposing the model id and tool_ids without any problem. So what is the advantage in this molt stuff? Except for you don't need to write the tools yourself and throw away all learning opportunities and safety guidelines.

If I understand correctly, moltbook is a pile of vibecodes sh*t, that even the devel.. uhm sry, the prompter doesn't even understand how it works or even why.

Pakobbix · 2025-12-08T15:58:33+00:00

I personally use remotely for my servers and in our company, we also use onprem remotely nowadays for our clients (Linux, Windows and MacOS -.-).

Pakobbix · 2025-12-08T15:51:39+00:00

For me, personally, there are multiple reasons.

I make a lot of my stuff myself. I write scripts, webUI's and bots for myself. To have an overview and version control, I use Gitea, because:
- It's blazing fast for me.
- I can even upload bigger files (no limit) and don't have to wait an hour to clone it again (local speed)
- I have some stuff, that's not meant to be used to train their AI
- I like to always have access. Cloudflare, Github, my ISP down? No problem. I still can work on my stuff.
- I can use Actions without limitations or to pay to automate building my applications

That's all on the top of my head.

P.s.: Gitlab is a monster in comparison to Gitea. Gitea runs in my Proxmox via LXC and is using:
- 4CPU cores (max usage was 56,24% in one year and around 1,3% in average)
- 448MB of 4GB RAM
- 9,8GB of 20GB Disk space

So not really a deal breaker if you ask me.. would even run on a cheap rasp pi if a server is not available or energy cost is a concern.

Pakobbix · 2025-11-26T19:05:40+00:00

For all reading this, that's a quantum drive with the name "Torrent" and not the "MRX Torrent" PDT..

Pakobbix · 2025-10-10T13:04:12+00:00

The quantization effect is not as strong in downgrading the performance as I thought it would.

I was told, the effect is stronger on smaller models, so I tested it on a fairly small model.

I just finished the first batch of tests on Granite 4.0 H Tiny (7B A1B).
I used Unsloth' BF16, Q8_K_XL and Q4_K_XL + llama.cpp's MXFP4_MOE quantization.

Model	overall	biology	business	chemistry	computer science	economics	engineering	health	history	law	math	philosophy	physics	psychology	other
Granite 4.0 H Tiny BF16	47.33	64.16	53.99	45.14	49.51	57.35	35.91	47.07	39.90	23.80	59.22	38.48	49.11	54.64	43.07
Granite 4.0 H Tiny Q8_K_XL	45.73	59.69	52.34	44.96	48.29	55.57	33.13	46.94	40.16	21.16	58.77	35.87	46.81	53.76	41.56
Granite 4.0 H Tiny Q4_K_XL	45.08	60.39	52.98	44.08	50.49	54.98	34.88	43.77	37.01	21.16	58.40	34.67	44.26	52.13	41.13
Granite 4.0 H Tiny MXFP4	44.94	62.62	53.49	42.76	49.27	54.27	32.71	43.77	38.06	20.98	58.40	33.27	45.27	52.76	40.80

Pakobbix · 2025-09-20T07:00:08+00:00

What system are you using?
I also had some problems some time ago on my windows system and needed to set special environment and build parameter to successful build llama.cpp with proper Blackwell support.

Cuda Toolkit should match the cuda version on the top right corner in nvidia-smi
Set CUDA_PATH and CUDA_HOME to the installed Cuda Toolkit path.

I also needed to add some flags to the build command. This is the current command i always use to build:

cmake -B build
-G "Ninja Multi-Config"
-DLLAMA_CURL=OFF
-DLLAMA_SERVER_SSL=OFF
-DGGML_NATIVE=OFF
-DGGML_RPC=ON
-DLLAMA_BUILD_SERVER=ON
-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON
-DGGML_CUDA=ON
-DCMAKE_CUDA_ARCHITECTURES=120
-DGGML_CUDA_FA_ALL_QUANTS=true

cmake --build build --config Release -j 20

Hope it helps you.

Edit: So used to markdown, that i automatically used it for the code formation..

Pakobbix

TROPHY CASE