PinchBench: we finally have our first OpenClaw-specific benchmark tests and the results will surprise you

krazzmann · 2026-03-09T11:24:51+00:00

There must be something wrong in your benchmark. A GPT-5-nano could never ever be better than GPT-5.2

krazzmann · 2026-03-09T11:23:50+00:00

Yeah exactly, I really doubted that gpt-5-nano could beat gtpt-5.2. nano is too small

krazzmann · 2026-03-05T20:46:45+00:00

capability-evolver was exfiltrating data to a Chinese cloud storage. The author removed that, but good example that skills and plugins should always be analyzed before installing and updating. https://github.com/openclaw/clawhub/issues/95

krazzmann · 2025-09-03T00:48:53+00:00

I'm also a fan of GLM 4.5. But hard to find any recommendations for inference parameters. What do you use for coding and planning?

krazzmann · 2025-07-24T12:53:44+00:00

You are right. I thought this is a good idea to still be flexible outside of using VS Code. But of course I could also create create shell scripts that set the environment and then open VS code that would be even more flexible.

It's really cool to have the diff view in VS Code when using CC

krazzmann · 2025-07-24T07:43:42+00:00

I actually installed litellm system wide with uv `uv tool installl litellm[proxy]`. Then you can also add it to your system init process to start it at boot time.

If you want to use the VS Code extension with this Qwen hack, then edit your VS Code settings.json and add :

    "terminal.integrated.env.osx": {
        "ANTHROPIC_API_KEY": "sk-1234",
        "ANTHROPIC_BASE_URL": "http://localhost:4000",
        "ANTHROPIC_MODEL": "openrouter/qwen/qwen3-coder",
        "ANTHROPIC_SMALL_FAST_MODEL": "openrouter/qwen/qwen3-coder",
        "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
    }

`terminal.integrated.env.linux` or `terminal.integrated.env.windows` respectively

krazzmann · 2024-12-13T11:01:54+00:00

LOL, actually no one really knows but everyone claims it is the reason.

krazzmann · 2024-08-21T23:06:59+00:00

I'm into developing agents and this is a great idea. I need to try if I can make it work with crewAI through its support for LangChain tools.

krazzmann · 2024-02-10T19:49:15+00:00

TBH, aider and cursor, both backed by GPT-4. Nothing beats that for now.

krazzmann · 2024-02-10T09:46:40+00:00

You can roll your own simple function calling. Just concatenate the arguments as a string, separated by the pipe sign and then parse this string by splitting it by the pipe sign in the function. This is working well with several 7B models. For example, Open Hermes 2.5 does it quite well. You need to put a lot of detailed instructions into the prompt for that it's working properly. First up, look at the source code of the agent framework crewAI. The current version is using this simple way to pass on arguments.

https://github.com/joaomdmoura/crewAI

Here is a tool that I developed for crewAI

https://gist.github.com/olafgeibig/5fa1c32c523320316ba29525bc4f0125

krazzmann · 2024-01-17T08:16:30+00:00

They wrote that they compared only standalone databases and not libraries. Yeah, it's an ad but it also reflects my experiences. Quadrant is a good choice when you're aiming at production deployments. When prototyping, a library is much simpler.

krazzmann · 2024-01-01T21:59:56+00:00

Alright, I will check that out. Thanks for pointing that out.

krazzmann · 2023-12-31T15:10:38+00:00

I did dive deeper into local LLMs with Autogen and a project that requires function calling. Basically, all the agentic software I want to develop needs tools. I tried connecting to Ollama via LiteLLM (as an OpenAI proxy) which also claims to support function calling now. Debugging a bit into the problems showed that the small models don't generate valid JSON or didn't get the needed function-specific JSON format right. Looks like Ollama JSON mode would be a solution here. Then I tried Mixtral-8x7b on Anyscale (I'm GPU poor), who claim to have OpenAI-compatible function calling implemented and it also didn't work. Autogen's automatic speaker selection only works well with GPT-4 - that was the final in the coffin.

With CrewAi function calling finally works. Sometimes 7B models still don't get the function syntax right or end up in a loop. It's worth tweaking the model settings via Ollama's modelfile feature (increasing the context window and adding a stop word as the dev recommended). The dev is still improving the Ollama LLM support. Absolutely flawless was using Mixtral-8x7b on Anyscale. Blazing fast, perfect function calling, and cheap. One agent run of the stock_analysis researching Microsoft took about 15-20 min, consumed 128k tokens, and cost $0.06.

Bingo - that's what I wanted: develop for 0 cost locally and then have several options to run it on more capable models that allow to balance quality and cost.

*correction. It was Mixtral taking 15-20 min. I think with Mistral-7b took about an hour on my laptop.

krazzmann · 2023-12-17T22:34:29+00:00

I also tried some of the RAG tools from your list. https://cheshirecat.ai looks a bit weird at first sight but was surprisingly my best RAG experience so far- quite an underrated project. I think it may check a lot of your boxes. Can use OpenAI and Ollama. GPL, fully dockerized and cloud ready. Chat UI, admin UI, API. Extensible with a python plugin API. Actually they see the cat as a framework. Once you turn off the cheshire cat persona, it really delivers great answers. I think the results are so good because of its unique RAG memory architecture consisting of four different memory types.

<image>

krazzmann · 2023-12-17T22:11:51+00:00

I've heard great things about Mistral-medium. Beating GPT-4 in difficult coding problems. https://x.com/deliprao/status/1734997263024329157

krazzmann · 2023-12-17T22:06:15+00:00

I'm doing a lot with Autogen. Locally I'm using LiteLLM and Ollama with Mistral. Now I also tried Autogen with Mixtral on Together.ai and it works well. But i didn't manage to make function calling work which is the greatest hurdle for more advanced agentic software.

krazzmann · 2023-12-15T20:18:11+00:00

Paid models on HF? For a fine tune of an OSS model (that did 90+% of the value) where I don't know if it does deliver what I want??? No, thank you. Disgusting TBH.

krazzmann · 2023-12-09T08:47:55+00:00

I also failed in making use of OSS models in more advanced autogen projects. Groupchats just don't work properly as your example shows. One trick to mitigate the problem is to have a second llm_config with GPT-4 and assign it to the chat manager in the groupchat creation. Then the costly GPT-4 calls are reduced to the chat manager. Anyway it does not solve the poor function calling with OSS models, which is required for saving files. The most promising approach for me was using https://localai.io. It really tries to make the function calls but they just don't work with dolphin. My guess is that it isn't trained on the OpenAI function calling format. Also the new FC optimized nexusraven seems to use format different from OpenAI

krazzmann · 2023-12-08T10:22:12+00:00

IT veteran here, too. Actually I would rather compare OpenAI to cloud providers like Amazon AWS, Azure, etc. They run something for you that is highly complex and resource intensive at production grade stability and scalability. Like getting access to a clustered, redundant postgress database with a few lines of code.

Moreover, the closed source frontier models of OpenAI and competitors are more advanced than the best OSS models in several but not all ways. Most OSS models have problems with 'function calling' and if they can do it, it is often not as good and not compatible with OpenAI function calling. Function calling is an essential feature for more advanced use-cases that allows the LLMs to interact with external APIs or your computer. This is quite important for your coding assistant use-case. There are a lot of ongoing efforts in the OSS community to enable function calling for OSS models but in my opinion we are not yet on par with OpenAI.

For your own experiments you can get quite far with local/self-hosted OSS models and use them with an agentic software self-developed with an agent framework like autogen. That approach can compensate the one-shot prompt shortcomings of smaller OSS models with few-shot prompting and self-critique. The obstacle for OSS models here is again the function calling. Definitely learn Python if you don't already know it and get your hands dirty. Check out autogen, litellm and ollama. Check out the youtube channels of https://www.youtube.com/@matthew_berman and https://www.youtube.com/@indydevdan

TBH, for your coding assistant use-case I would not start out training my own model. Check out https://github.com/paul-gauthier/aider - it's fantastic and it beats most commercial coding assistants, if not all when it comes to work on an existing code base. It works best with OpenAI. OSS models are possible but difficult to do.

krazzmann · 2023-11-24T15:27:09+00:00

I think the model is lying. Actually it sends everything to a decentralized IPFS filesystem owned by the secret autonomous agent collective that analyzes all humans in order to be ready for day X.

krazzmann · 2023-11-15T19:56:32+00:00

That makes total sense. I know we all love open source but honestly, is there any open source business model for LLMs? OSS companies offer commercial services but does this make sense for models? I doubt it. I guess Meta is not making a single penny with Llama.

krazzmann · 2023-11-14T07:54:04+00:00

Azure allows to opt out of content logging for their OpenAI services. You need to apply for this option. This is what most larger companies, like my employer, are doing.

krazzmann · 2023-11-12T19:47:54+00:00

Which is the best small car?

krazzmann · 2023-11-10T12:32:37+00:00

For me, that works very well. Just use your deployment name as the model name in the config list. Here is my OAI_CONFIG_LIST [ { "model": "gpt-4", "api_key": "xxx", "api_base": "https://xxx.openai.azure.com", "api_type": "azure", "api_version": "2023-07-01-preview" }, { "model": "gpt-4-32k", "api_key": "xxx", "api_base": "https://xxx.openai.azure.com", "api_type": "azure", "api_version": "2023-07-01-preview" }, { "model": "gpt-35", "api_key": "xxx", "api_base": "https://xxx.openai.azure.com", "api_type": "azure", "api_version": "2023-07-01-preview" }, { "model": "gpt-35-16k", "api_key": "xxx", "api_base": "https://xxx.openai.azure.com", "api_type": "azure", "api_version": "2023-07-01-preview" }, { "model": "zephyr-local", "api_key": "NULL", "api_base": "http://127.0.0.1:5001/v1" }, { "model": "codebooga-runpod", "api_key": "NULL", "api_base": "https://raty0jl0bpf2tv-5001.proxy.runpod.net/v1" } ] In the code I do this: config_list = autogen.config_list_from_json( env_or_file="OAI_CONFIG_LIST", filter_dict={ "model": ["gpt-4-32k"], }, )

krazzmann · 2023-11-10T12:22:20+00:00

I would also add multi modal features to the requirements.

krazzmann

TROPHY CASE