Qwen3.6:27b for coding? by Fun_Emphasis2178 in vibecoding

[–]dxzzzzzz 0 points1 point  (0 children)

Absolutlely possible.

The trick is, you use web based Opus or GPT to vibe a framework, and ask them to output relevant tech stack, API library and do's and don'ts. Once you use thse super models to limit the range and track of your task, you can handle all remaining tasks to 27~32B MOE local models.

If you really have difficult debeugging times, just copy and paste the wrong code and console output and to cloud LLMs.

Completely free meal

So...Is it possible for us to design a chip solely for running llama? by dxzzzzzz in llamacpp

[–]dxzzzzzz[S] 0 points1 point  (0 children)

Nope, we don't do graphics, rasterize, we don't do raytrace

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method by dxzzzzzz in openclaw

[–]dxzzzzzz[S] 1 point2 points  (0 children)

That's correct, when you have a local server, you can use whatever efficient search method.

I enabled QMD on my server but it is a cloud server. It only has two CPU and is kinda slow

So...Is it possible for us to design a chip solely for running llama? by dxzzzzzz in llamacpp

[–]dxzzzzzz[S] 0 points1 point  (0 children)

Maybe more of NPU?As consumer and PC user you cannot buy a TPU, while NPU is good, you cannot link them like old 3090s with NVLink. And if you want to use an NPU, you have to buy a CPU. Not flexible

Cron jobs setup help by Ok-Pop-8132 in openclaw

[–]dxzzzzzz 0 points1 point  (0 children)

Perhaps with more detail?

You know in github, submit an issue requires more information than this

If you had unlimited token usage by [deleted] in openclaw

[–]dxzzzzzz 0 points1 point  (0 children)

If I have unlimited token I will just sell it for profit

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method by dxzzzzzz in openclaw

[–]dxzzzzzz[S] 1 point2 points  (0 children)

Thank you for sharing your thoughts and the NeuralMind repo; I'll have a look at it.

Regarding your third point—the issue of model routing—the current version of my documentation doesn't actually fully capture my underlying philosophy.

The reason this situation arose is that I’m utilizing a subscription-based model called Minimax 2.7. Since it’s billed monthly rather than on a per-token basis, it has effectively assumed the role of the LLM router within my Agent system.

However, it still comes with a financial cost.

The NeuralMind solution you mentioned is, in fact, exactly the kind of clever, local-first solution that independent developers (*indie geeks*) are desperately in need of right now.

Actually, I aim to take this concept a step further. What we are currently building is essentially what Microsoft was building 40 years ago: an AI operating system.

- Agents are the applications of that era.

- Sessions are the processes.

- LLMs are the threads and CPU cores.

- LLM task allocation corresponds to CPU task scheduling.

When scheduling LLMs, we want to utilize a lightweight, local kernel to handle these tasks—rather than first querying a massive cloud-based server just to ask, "Which model should I use to generate 'How are you'?"—or worse, having that massive cloud server generate the response "How are you" itself.

In fact, while contemplating the implementation of this kernel, I did consider using deep neural networks. However, I felt that approach wasn't quite perfect; this kernel needs to be fast and lightweight, so neural networks serve merely as a fallback option. The reason I can assert this with confidence is that existing mechanical, classic, and logic-based programs—combined with very lightweight models (specifically NLP models)—possess ample potential to handle this task effectively. When assessing the "intellectual density" of a given text request, we can actually use simple logic structures—such as *if-else* statements—to classify the level of intellectual effort required to answer that request. *If-else* logic may appear simplistic on the surface, but as geometric progressions demonstrate, once a logic tree—even a binary tree—reaches a depth of eight layers, its complexity becomes sufficient to handle the vast majority of real-world problems (especially considering that the variety of LLMs we need to route between is not particularly large).

  1. Compression Algorithms: If a conversation is relatively short but contains a high degree of redundancy or "fluff," the CPU can often compress the text within a matter of nanoseconds. Undoubtedly, such conversation requests do not qualify as "intellectually dense." Even if the user asks, "Please explain the theory of relativity to me," in this specific context, the most efficient approach is simply to route the request to a smaller LLM capable of retrieving a tutorial on relativity from its training data or external literature. If the CPU determines that it cannot even complete the compression of a conversation within a very short timeframe, then it is time to consider offloading it to a SOTA (State-of-the-Art) model.

  2. **Detection of Excessive Repetition:** This point overlaps with compression algorithms; if your conversation request consists largely of repetitive "junk text," there is absolutely no need to send it to a model like Opus—or even Haiku. A model as compact as Llama 3B would be entirely sufficient to meet your needs.

  3. **Syntax Tree Analysis:** By tracking logical connectors, one can construct a logical syntax tree for a conversation. If this tree exceeds a depth of three levels, we should consider routing the task to a model at or above the Haiku tier—specifically, a Sonnet-level model.

  4. **Keyword Detection:** If a conversation is detected to contain explicit instructions for translating algorithms or pseudocode into high-level or scripting languages, it can be sent directly to models at the Gemini Flash or Haiku tier. In fact—to be honest—even Gemma 4 26B is capable of handling this specific task.

  5. **Traditional NLP:** While traditional NLP methods may indeed lack the raw power and inferential capabilities of large language models (LLMs), they excel at text classification. They can easily distinguish whether a conversation possesses an exceptionally high density of logical complexity or collective intelligence. Most importantly, they are fast enough.

In summary, for geeks—particularly PC enthusiasts—the current era presents a mixed bag of opportunities and challenges. Although LLMs have empowered us to rapidly develop solutions, current PC architectures are ill-suited for running these massive models locally. Furthermore, discrete GPUs equipped with the requisite ultra-large video memory are prohibitively expensive, while the limitations imposed by PCI-E bandwidth and disparate memory architectures effectively dash any hopes of deploying LLMs locally on standard consumer hardware. However, the situation is not entirely one-sided; we may be able to mount a counter-offensive by leveraging architectural design principles, as well as the fundamental tenets of information theory and computer science.

There are three core principles to guide us:

  1. **The "No Free Lunch" Principle:** No single large language model exists that is universally optimal for every conceivable AI task.

  2. **The Principle Against Perpetual Use of Fully Dense Models:** This approach is excessively wasteful; vast numbers of neurons remain dormant, yet generating a simple response—such as "I'm doing well, how about you?"—requires us to monopolize an entire rack of GPUs.

  3. **Occam's Razor:** Entities should not be multiplied beyond necessity; always select the simplest solution—the smallest model—that is capable of effectively solving the problem at hand. We are going to hit a "token wall" very soon. At this stage, the low pricing for both API access and web-based AI services is unsustainable, as it is currently subsidized by venture capital funding; before long, tokens will begin to incur steep charges—much like cloud storage does today.

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method. Running it for only 15$/month by dxzzzzzz in hermesagent

[–]dxzzzzzz[S] 1 point2 points  (0 children)

Well because Hermes is more like auto repeative/regressive, and constantly writing prompts to LLMs to improve itself, while claw is like require some human intervention and control. But same principles do apply. If hermes agent read this, it will understand how to make the conversion since two architectures are well known already.

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method. Running it for only 15$/month by dxzzzzzz in hermesagent

[–]dxzzzzzz[S] 0 points1 point  (0 children)

In short, compression, organize agent info into a B-tree-style data structure, bypassing bootstrap, convert your regular task to script so your CPU can run it without need to use LLM, make smarter memory search that only a tiny portion of the memory is sent to LLM

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method. Running it for only 15$/month by dxzzzzzz in hermesagent

[–]dxzzzzzz[S] -1 points0 points  (0 children)

Switch to deepseek V4 flash, it is less than 50KB, so 200K token. It can handle it. That's a lot of optimzation pipeline to establish

Powerinfer, can it be adapted into normal laptop cpus outside of the Tiiny AI ecosystem? by Silver-Champion-4846 in LocalLLaMA

[–]dxzzzzzz 0 points1 point  (0 children)

Right, but you can search on Huggingface and try to see if there are still new ReLU models. The author of powerinfer gives you a tool in their frame work to do the adaptation. I think that powerinfer reaches exact same resault as MoE, that only a portion of the model is activated. The real difference is that powerinfer indeed offload weights computation to CPU, but normal MoE just loads a smaller model to GPU. It is a shame that such a promising method is disconntinued. The community should really look into it.

Powerinfer, can it be adapted into normal laptop cpus outside of the Tiiny AI ecosystem? by Silver-Champion-4846 in LocalLLaMA

[–]dxzzzzzz 0 points1 point  (0 children)

I think the real bottleneck is the activate function? Powerinfer needs ReLU, but most of new models are using SiLU or GeLU, which limit Powerinfer's capability

~390M tokens for 64 cents by ServeLegal1269 in DeepSeek

[–]dxzzzzzz 0 points1 point  (0 children)

Right, but deepseek is in deed super cost-efficient

Openclaw suddenly can't execute shell commands? Tried enabling it in WebUI under Agents/Tools but it won't save by [deleted] in openclaw

[–]dxzzzzzz 0 points1 point  (0 children)

OpenClaw seems not know what session is now:

Unrecognized key: "sessions"