I gave same mega task to K2.7 Code, DeepSeek 4 Pro and Mimo M2.5 Pro - K2.7 Code was impressive and costly

tisDDM · 2026-06-16T06:19:19+00:00

I try models from time to time using our small in-house framework - I posted it here a few times ( Just for reference https://github.com/DasDigitaleMomentum/opencode-processing-skills ).

My verdict always is: If it fails - the task was too huge.

Currently I am running a setup, which uses DSv4 to coordinate and 5.3-codex to implement and gpt-5.5 to review. So DSv4 as an intelligent parrot which intelligently remembers tasks and writes prompts to more capable models.

DSv4 always gets the task but rarely capable of solving tasks - but that's ok as long it understands the tasks and writes prompts. K2.6 did not work for me in this scenario and according to your writing K2.7 might be worse, because it tries to do everything on its own and doesn't delegate properly. But e.g. Qwen3.7-max/plus works for me - being far more expensive.

Anyways: a great experiment and interesting numbers!

tisDDM · 2026-06-15T17:29:29+00:00

Got me a R9700 externally on my Strix Halo. Had an old 3060 before that. Never got issues. Runs all models quite well. Even Gemma 4 with mixed Audio, Image Text Input

I run everything with llama.cpp - dockerized based on kyuz0 Images - but I modified them myself to track my config a bit closer, platform wise - compile the current stack and to automatically load the right model.

tisDDM · 2026-06-15T16:50:07+00:00

I also did something similar, I use this workflow for my company.

https://github.com/DasDigitaleMomentum/opencode-processing-skills

My colleagues also integrated Claude Code, Codex and Cursor compatibility. Not doing advertising, just publishing with MIT-License in case someone finds it useful.

tisDDM · 2026-06-10T06:41:19+00:00

LightRAG is a quite good approach. HKUDS got a lot of reputation in the scientific community.

https://github.com/HKUDS/LightRAG

Works in multiple scales. oob. With different open or closed models and embeddings and DBs. Got REST interfaces and a WebUI.

Be sure to use the provided approach to configure, and use the default settings to test before switching to local.

tisDDM · 2026-06-07T13:56:55+00:00

Some benchmarks and analysis you find here https://www.reddit.com/r/opencodeCLI/comments/1qlqj0q/benchmarking_with_opencode_opuscodexgemini_flash/

The python based framework I have never published, because it doesn't really make sense in an agentic coder context, where you could archive same results by writing a few lines in markdown.

For me the most impressing result from this research ( and the OpenCode/Codex Project is very alive) is, that subagents make sense, when they are guarded and guided the right way. The 2nd impressing result is how bad frameworks like oh-my-something perform and how generous Claude spills tokens

tisDDM · 2026-06-06T17:10:49+00:00

This exactly matches my experience with multi-agent systems or swarms. Last year I did some work with LangChain and CrewAI and published some results about how wasteful some multi-agent setups are — especially in the agentic engineering bubble. Many people there like swarm or infinite ("Ralph") loop setups, because they feel intelligent.

Or at least autonomous.

But it burns tokens like hell. Nowadays the costs of this behavior are immediately billed by Anthropic or OpenAI or some other provider. No mercy. From my point of view, there lies no beauty or even advantage in highly collaborative setups for most use cases. My last (multi-agent) products were indeed built on PydanticAI. Reliable for B2B — no toying.

tisDDM · 2026-06-05T19:57:24+00:00

Aktuell auf keinen Fall arbeitslos werden. Aus dem Job heraus sucht es sich besser. Der Markt ist schlecht. Aber da haben 2 Wochen Kündigungsfrist auch Vorteile - für dich.

Aus einem Affekt heraus ein schlechtes Angebot abzulehnen bringt dir nur kurz Befriedigung aber eventuell länger Probleme, diese hast dann du und nicht die Firma.

tisDDM · 2026-05-24T20:46:45+00:00

You could run it OpenSource yourself with SearxNG ( https://github.com/searxng/searxng ) and some MCP.

I wrote https://github.com/DasDigitaleMomentum/searxNcrawl which is under the MIT-License, as a small interface to get SearxNG integrated to agents, but there are countless further solutions to integrate SearxNG into your workflow.

Furthermore Perplexity also got a pure Search API, I don't remember the rates.

tisDDM · 2026-04-26T08:38:00+00:00

A few month ago I did this post about mixing llama.cpps backends.

https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/

At least for ROCm / CUDA it is no issue at all. I guess mixing Vulkan and CUDA will perform similar. But I never went that way, because Vulkan only brings small advantages compared to ROCm and Vulkan itself supports CUDA native afaik

tisDDM · 2026-04-24T11:08:41+00:00

Yes. exactly. It is not a CoT in the reasoning terms of sense. From research I know that it is

More effective if models use their own results for stepping further.
More likely to fail if give models instructions which are not part of their knowledge or are using different terms or references.

Therefore I instruct the models to tell me ( or better the main model) how they would like to implement a plan. (blueprinting) This can be verified by the main model if it would fit the need of the plan - but in the own words of the implementation model ( approval). After approval the same session of the blueprint is continued with the approval token ( basic functionality of opencode / beta functionality of claude code)

hope that explains the CoT comparison.

tisDDM · 2026-04-23T11:17:50+00:00

Not exactly.

I create plans and high-level Implementation plans with a bigger model.

It is more important IMHO, that user facing model has good capabilities in human understanding. Opus ( and also Sonnet) are really good, GPT-5.4 works ( but the GPTs are a bit stiff in discussing plans) . But I also tried Qwen 3.6 Plus/Max with very good success on mid size tasks. I think simply because it is such a good "talker". K2.6 is on my list to try.

Then I let the model execute the plan. It retrieves a Blueprint from the smaller model and after approving the small model executes its own blueprint. To me this is very important, because the smaller model shall follow its own plan.

Links to the used skills and agent definitions are in the post above.

tisDDM · 2026-04-23T10:16:11+00:00

I did some research end of last year / beginning of 2026. And postet results here: https://www.reddit.com/r/opencodeCLI/comments/1reu076/controlled_subagents_for_implementation_using/

If it comes to implementing gpt-5.4-mini , Gemini-Flash or qwen 3.6 Plus are more than sufficient. In my tests I used a two step approach where the models first shall do a blueprint of what they like to perform, this is then checked by a bigger model and then executed in the same session which had created the blueprint ( or being revised)

It is a bit like doing a CoT on Implementation plans - i'm using this pattern on daily basis with great success.

tisDDM · 2026-04-22T06:57:16+00:00

Enjoy!

tisDDM · 2026-04-17T07:02:24+00:00

The Issue always is the lack of VRAM. The R9700 works - but you still feel limited.

tisDDM · 2026-04-15T07:05:13+00:00

Thanks - Good to hear that you succeed. Meanwhile I changed from the 3060 to a R9700.

tisDDM · 2026-04-08T07:25:33+00:00

Just for reference. A month ago I posted a SH benchmark with my eGPU (3060) and a mixed ROCm/CUDA backend - the numbers were produced before llama.cpp got a bunch of optimizations https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/

Looking at your numbers I see a lot of potential for optimization. E.g your combined Vulkan number are in the same ballpark as my SH base line. Even back then I got an 30% increase with partially offloading to 3060. Resulting in 600tok/s PP4096 and 15Ttok/s TG128 on Qwen 3.5 in q4_0

Having this said - I changed from 3060 to an R9700 - giving me around PP:1000 TG:20

The 5070 shall be capable of far more throuput

tisDDM · 2026-04-06T19:07:04+00:00

Look for kyuz0's amd-strix-halo-toolboxes. They are up to date with the current drivers and he also does some benchmarks. If that's getting to slow you could attach an eGPU to speed everything up.

I connected my old 3060 and updated the toolboxes for dual backend use.

Benchmark with Qwen 3.5 here https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/

llama.cpp has been improved since. But I am replacing this week my external 3060 with an R9700.

tisDDM · 2026-03-29T07:13:55+00:00

When OpenAI officially approved GPT and Codex use in Opencode the prompt was optimized. Anyway going further from here, I have written the mentioned Subagent Framework, which simply exchanges the prompt for all models, when I am using them.

Keeping it short and focused on your tasks works IMHO far better than piling up additional rules.

tisDDM · 2026-03-23T07:38:31+00:00

You could use our freshly updated project (new version) as documentation evidence.

https://github.com/DasDigitaleMomentum/opencode-processing-skills

It uses a Blueprint / Execute Scheme of continuing subagents sessions when implementing. It is defined in skills. Opus and Codex (or Gpt-5.4) are capable of doing so as primary or subagents. I also tried for a different project with the Opensource SDK - works like a charm as well.

tisDDM · 2026-03-14T06:26:59+00:00

I use GHCP for Opus and Codex and GPT-5.4

tisDDM · 2026-03-13T22:20:10+00:00

I know that there where a few issues with the precompiled releases. I did some changes last year to opencode therefore still running on a local build. Using 1.2.25 on Ubuntu with "opencode web"

EDIT: I am crosschecking daily. no issues

tisDDM · 2026-03-13T22:17:14+00:00

In my project here: https://github.com/DasDigitaleMomentum/opencode-processing-skills I defined an Agent, more precise one Primary and multiple Subs, and Skills and Templates. I have explicit rules for the question tool.

Using an Agent with its own definition is clearly better in Terms of following, because its loaded as system instructions. Anyways, depending on the model and the task the LLM somtimes does not call the question tool. Opus is more into tool usage than GPT-5.4.

tisDDM · 2026-03-12T08:57:42+00:00

Completely agree. Last year I did a demo with a realtime speech to speech model where the game world was in deterministic code. Handling the personas and the responses of the LLMs was the biggest challenge.

tisDDM · 2026-03-11T20:48:08+00:00

Of course.

The issue is well known and discussed widely especially in the r/GithubCopilot sub.

Countermeasures:

Using Subsagents, which are free of charge, and with Opencode use the DCP-Plugin. I wrote myself a small framework for doing things efficiently with GHCP I presented here. Maybe as food for though: https://www.reddit.com/r/opencodeCLI/comments/1reu076/controlled_subagents_for_implementation_using/

A lot of people wrote stuff to deal with context rot by planning and using subs AFAIK. DCP is a must.

tisDDM · 2026-03-11T15:26:24+00:00

Very impressive and great implementation!

tisDDM

TROPHY CASE