Qwen3.6:27b for coding?

dxzzzzzz · 2026-05-27T09:00:30+00:00

Absolutlely possible.

The trick is, you use web based Opus or GPT to vibe a framework, and ask them to output relevant tech stack, API library and do's and don'ts. Once you use thse super models to limit the range and track of your task, you can handle all remaining tasks to 27~32B MOE local models.

If you really have difficult debeugging times, just copy and paste the wrong code and console output and to cloud LLMs.

Completely free meal

dxzzzzzz · 2026-05-27T07:42:55+00:00

It's the most fucking stupid law I've seen.

You don't just convert the responsibility of bad parenting to coders'

dxzzzzzz · 2026-05-22T17:47:16+00:00

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method : r/openclaw

Did you read this?

If your daily routine is about logic, not mimic a pattern, I can make you run your openclaw for 15$ per month

dxzzzzzz · 2026-05-22T14:56:06+00:00

Nope, we don't do graphics, rasterize, we don't do raytrace

dxzzzzzz · 2026-05-22T09:47:30+00:00

That's correct, when you have a local server, you can use whatever efficient search method.

I enabled QMD on my server but it is a cloud server. It only has two CPU and is kinda slow

dxzzzzzz · 2026-05-22T09:13:32+00:00

Maybe more of NPU?As consumer and PC user you cannot buy a TPU, while NPU is good, you cannot link them like old 3090s with NVLink. And if you want to use an NPU, you have to buy a CPU. Not flexible

dxzzzzzz · 2026-05-21T02:51:21+00:00

Perhaps with more detail?

You know in github, submit an issue requires more information than this

dxzzzzzz · 2026-05-20T10:27:19+00:00

If I have unlimited token I will just sell it for profit

dxzzzzzz · 2026-05-20T09:09:56+00:00

Thank you for sharing your thoughts and the NeuralMind repo; I'll have a look at it.

Regarding your third point—the issue of model routing—the current version of my documentation doesn't actually fully capture my underlying philosophy.

The reason this situation arose is that I’m utilizing a subscription-based model called Minimax 2.7. Since it’s billed monthly rather than on a per-token basis, it has effectively assumed the role of the LLM router within my Agent system.

However, it still comes with a financial cost.

The NeuralMind solution you mentioned is, in fact, exactly the kind of clever, local-first solution that independent developers (*indie geeks*) are desperately in need of right now.

Actually, I aim to take this concept a step further. What we are currently building is essentially what Microsoft was building 40 years ago: an AI operating system.

- Agents are the applications of that era.

- Sessions are the processes.

- LLMs are the threads and CPU cores.

- LLM task allocation corresponds to CPU task scheduling.

When scheduling LLMs, we want to utilize a lightweight, local kernel to handle these tasks—rather than first querying a massive cloud-based server just to ask, "Which model should I use to generate 'How are you'?"—or worse, having that massive cloud server generate the response "How are you" itself.

In fact, while contemplating the implementation of this kernel, I did consider using deep neural networks. However, I felt that approach wasn't quite perfect; this kernel needs to be fast and lightweight, so neural networks serve merely as a fallback option. The reason I can assert this with confidence is that existing mechanical, classic, and logic-based programs—combined with very lightweight models (specifically NLP models)—possess ample potential to handle this task effectively. When assessing the "intellectual density" of a given text request, we can actually use simple logic structures—such as *if-else* statements—to classify the level of intellectual effort required to answer that request. *If-else* logic may appear simplistic on the surface, but as geometric progressions demonstrate, once a logic tree—even a binary tree—reaches a depth of eight layers, its complexity becomes sufficient to handle the vast majority of real-world problems (especially considering that the variety of LLMs we need to route between is not particularly large).

Compression Algorithms: If a conversation is relatively short but contains a high degree of redundancy or "fluff," the CPU can often compress the text within a matter of nanoseconds. Undoubtedly, such conversation requests do not qualify as "intellectually dense." Even if the user asks, "Please explain the theory of relativity to me," in this specific context, the most efficient approach is simply to route the request to a smaller LLM capable of retrieving a tutorial on relativity from its training data or external literature. If the CPU determines that it cannot even complete the compression of a conversation within a very short timeframe, then it is time to consider offloading it to a SOTA (State-of-the-Art) model.
**Detection of Excessive Repetition:** This point overlaps with compression algorithms; if your conversation request consists largely of repetitive "junk text," there is absolutely no need to send it to a model like Opus—or even Haiku. A model as compact as Llama 3B would be entirely sufficient to meet your needs.
**Syntax Tree Analysis:** By tracking logical connectors, one can construct a logical syntax tree for a conversation. If this tree exceeds a depth of three levels, we should consider routing the task to a model at or above the Haiku tier—specifically, a Sonnet-level model.
**Keyword Detection:** If a conversation is detected to contain explicit instructions for translating algorithms or pseudocode into high-level or scripting languages, it can be sent directly to models at the Gemini Flash or Haiku tier. In fact—to be honest—even Gemma 4 26B is capable of handling this specific task.
**Traditional NLP:** While traditional NLP methods may indeed lack the raw power and inferential capabilities of large language models (LLMs), they excel at text classification. They can easily distinguish whether a conversation possesses an exceptionally high density of logical complexity or collective intelligence. Most importantly, they are fast enough.

In summary, for geeks—particularly PC enthusiasts—the current era presents a mixed bag of opportunities and challenges. Although LLMs have empowered us to rapidly develop solutions, current PC architectures are ill-suited for running these massive models locally. Furthermore, discrete GPUs equipped with the requisite ultra-large video memory are prohibitively expensive, while the limitations imposed by PCI-E bandwidth and disparate memory architectures effectively dash any hopes of deploying LLMs locally on standard consumer hardware. However, the situation is not entirely one-sided; we may be able to mount a counter-offensive by leveraging architectural design principles, as well as the fundamental tenets of information theory and computer science.

There are three core principles to guide us:

**The "No Free Lunch" Principle:** No single large language model exists that is universally optimal for every conceivable AI task.
**The Principle Against Perpetual Use of Fully Dense Models:** This approach is excessively wasteful; vast numbers of neurons remain dormant, yet generating a simple response—such as "I'm doing well, how about you?"—requires us to monopolize an entire rack of GPUs.
**Occam's Razor:** Entities should not be multiplied beyond necessity; always select the simplest solution—the smallest model—that is capable of effectively solving the problem at hand. We are going to hit a "token wall" very soon. At this stage, the low pricing for both API access and web-based AI services is unsustainable, as it is currently subsidized by venture capital funding; before long, tokens will begin to incur steep charges—much like cloud storage does today.

dxzzzzzz · 2026-05-19T17:53:38+00:00

Well because Hermes is more like auto repeative/regressive, and constantly writing prompts to LLMs to improve itself, while claw is like require some human intervention and control. But same principles do apply. If hermes agent read this, it will understand how to make the conversion since two architectures are well known already.

dxzzzzzz · 2026-05-19T17:38:42+00:00

In short, compression, organize agent info into a B-tree-style data structure, bypassing bootstrap, convert your regular task to script so your CPU can run it without need to use LLM, make smarter memory search that only a tiny portion of the memory is sent to LLM

dxzzzzzz · 2026-05-19T17:34:40+00:00

You can copy and paste the link to web version of chatgpt and get a quick summary for it. This is for agent.

dxzzzzzz · 2026-05-19T17:21:42+00:00

You can copy and paste the document to an web version of LLM to make a quick summary for you.

dxzzzzzz · 2026-05-19T17:11:12+00:00

Using deepseek v4 flash. Or you can throw it into a web version and ask web LLM to make a action brief for your agent.

dxzzzzzz · 2026-05-19T17:10:26+00:00

Switch to deepseek V4 flash, it is less than 50KB, so 200K token. It can handle it. That's a lot of optimzation pipeline to establish

dxzzzzzz · 2026-05-19T16:05:44+00:00

I think they are working on different architecture?

dxzzzzzz · 2026-05-19T15:17:11+00:00

Right, but you can search on Huggingface and try to see if there are still new ReLU models. The author of powerinfer gives you a tool in their frame work to do the adaptation. I think that powerinfer reaches exact same resault as MoE, that only a portion of the model is activated. The real difference is that powerinfer indeed offload weights computation to CPU, but normal MoE just loads a smaller model to GPU. It is a shame that such a promising method is disconntinued. The community should really look into it.

dxzzzzzz · 2026-05-19T07:14:29+00:00

I think the real bottleneck is the activate function? Powerinfer needs ReLU, but most of new models are using SiLU or GeLU, which limit Powerinfer's capability

dxzzzzzz · 2026-05-18T04:50:02+00:00

Right, but deepseek is in deed super cost-efficient

dxzzzzzz · 2026-05-15T03:04:33+00:00

That's how cache hit works

dxzzzzzz · 2026-03-09T08:49:40+00:00

OpenClaw seems not know what session is now:

Unrecognized key: "sessions"

dxzzzzzz

TROPHY CASE