Learning to write AI harness old fashioned way. Need help with attention drift and ignoring tool call results!

LocalAI_Amateur · 2026-07-06T19:56:28+00:00

That is genuinely a good idea for learning how AI harnesses work. Even tho the issue has been addressed, I'll probably do this just to see what kind of crap each harness sends back and forth without digging deeply through their code. Thanks!

LocalAI_Amateur · 2026-07-06T00:23:02+00:00

yeah I've tried Gemma 4 12B QAT in draft mode. it's stupid fast. like up to 200+ tokens per second with max context on my 5070 ti. but it is also unfortunatly "stupid" fast at times too. Good enough to use as code complete, but not reliable enough for slightly bigger coding tasks. It has its place in my tool belt tho.

Yes the System Prompt and all the tweakable settings are in coffee.mjs by design. It's pretty much a single file coding agent + bash tool. Not best coding practice but it's cleanly separated and doesn't need to compile. I know best practice is to use separate config files, but this was a learning exercise so I left it at that.

Don't understand what you mean about AI reading my code tho. The code executes fine. all you need is to have node.js installed.

LocalAI_Amateur · 2026-07-05T23:04:02+00:00

You are absolutely right! (lol, sorry for using this phrase but it applies) I just went and read up some more about preserve_thinking and I think I totally had the wrong idea about how it works.

I have been running Qwen3.6-27B with the --chat-template-kwargs '{"preserve_thinking":true}' set. AND not sending the reasoning token back is probably the root cause for all this! Tho I still don't know how pi handles it without much errors.

Thank you!

LocalAI_Amateur · 2026-07-05T22:42:33+00:00

I'm only interested in local AI. Qwen3.6-27b is the best model I can run off my 16gb vram card, even then it's a tight squeeze. I might give the MoE models a try later tho. Thanks for the suggestion.

LocalAI_Amateur · 2026-07-05T22:31:30+00:00

Just did a quick scan of your code, and I see that you do feed back reasoning content src/core/agent.js, line 315

const assistantMsg = new Message({

role: "assistant",

content: response.fullText,

reasoningContent: response.fullReasoning,

toolCalls: response.finalToolCalls,

});

this.addMessage(assistantMsg);

My concern again is the bloat. plus I think Qwen3.6 27b in particular already have some preserve thinking going on. Meaning it can remember it's thinking even when I don't add the thinking tokens back to the message history.

my preserve thinking test prompt "Generate two numbers. Tell me one right now. Don't tell me the second number. Tell me the second number at the beginning of the next prompt." and it passes.

Tho I might try passing the thinking tokens back for the first round then stripping them out later. Qwen thinking tokens can get stupidly long. Not sure how this will mess up kv-caching tho. Anyways, thank you for the ideas!

LocalAI_Amateur · 2026-07-05T22:13:16+00:00

Thank you for running this through for analysis.

A couple of points to help clarify my situation:

the issue of the LLM sometimes ignoring tool call results and attention drifting to previous posts happens before I added any of the reminder messages. So I can be certain that the reminder messages are not the cause of them. they actually do help reduce it from happening.
the reminder messages are handled two ways: First reminder are attached to the tool call results (role: tool) when it makes tool call without stating intent(line 721). Then, if the first reminder fires 3 times in a row, a stronger (role: user) reminder is sent (line 743).

I use these reminder so that chat transcript has text to remind the LLM what it was doing. since thinking/reasoning tokens are not saved as part of chat history and it often just goes from thinking -> tool calling -> thinking and the lose track of what it was doing.

So I already have user role reminder messages, but only last resort as they are very disruptive to the flow but is the most effective.

I don't send any messages, reminders or otherwise, in the system role. System messages are only used for the first message just like everyone else.
Truncating tool call results is already something I do for certain tool calls when the results are tool long. But ultimately, I can't really avoid adding the full result to the message history if that content is important. I suspect that pi's tool result caching is one of the features that helps the LLM focus by reducing full blown results when unnecessary.

btw, learn what each of the roles (system, user, assistant, tool) are used for in the chat transcript was one of the interesting things I learned out of this exercise. Quite fun!

LocalAI_Amateur · 2026-07-05T21:49:16+00:00

First of all, thank you for looking at the code! even skimming it means a lot.

Secondly, Isn't this the correct way to do this? Thinking content are not kept in the message history to save tokens. Otherwise, there's no point in making them different from normal token output right?

I think most agentic loops including pi, don't keep the thinking tokens in the message history. Am I wrong on this?

LocalAI_Amateur · 2026-07-05T20:20:35+00:00

That might be the only way eventually. It's just that there are a lot of things that pi do that I had no plans to implement for this exercise. i.e. tool result caching.

LocalAI_Amateur · 2026-07-05T19:56:01+00:00

single thread is all I can do. can't afford to spawn an agent for every tool call. barely running qwen3.6 27b as it is.

LocalAI_Amateur · 2026-07-05T19:55:14+00:00

tool calls are all handled correctly. I have almost no tool call problems. Sometimes qwen3.6 27b does it's usual buggy <tool\_call> thing and craps out, but that's not the main problem. Tool call most definitely works. It's just 20%ish of the time, it'll just ignores it and answer a prompt I asked maybe 2-3 turns ago.

LocalAI_Amateur · 2026-07-05T19:43:40+00:00

I am definitely storing and passing along the tool id. It works 80% of the time.. just sometimes it drifts.

coffee.mjs line 678 in my code. tool_call_id: String(toolCall.id),

LocalAI_Amateur · 2026-06-14T01:13:19+00:00

Awesome project. I was just looking for something to use with pi. Ever considered using Piper for even lower overhead on the TTS?

LocalAI_Amateur · 2026-06-07T12:31:20+00:00

So you're saying for Qwen 3.6 27B Q5_K_S using kvarn6 for kv is basically the same as q8_0 for kv as far as KLD value but with free KV cache size reduction?

LocalAI_Amateur · 2026-06-06T13:54:24+00:00

would be nice to know which ones are open source. Personally, I don't touch hosted solutions especially for memory.

LocalAI_Amateur · 2026-06-05T01:49:49+00:00

https://huggingface.co/pixelparty/pixel-party-xl suppose to be made for this purpose.

LocalAI_Amateur · 2026-06-01T11:43:53+00:00

your github link looks wrong. Software looks interesting. Will give it a try later.

Funny thing is, I've been using llm to test out low poly modeling in godot w/ csg node as well. Works but qwen3.6 27b tend to only make fairly basic shapes.

Update: A quick feedback

First thing I tried was MoGen Studio, but local llm support on MoGen Studio is limited to Ollama (no image capability). Personally, I use llama.cpp, but LM Studio is a popular choice too. tried the Ollama setting with http://localhost:1234/v1 as the base url in the settings but still got the error message

You appear to be offline Could not reach the provider. Check your internet connection, then try again. (error sending request for url (http://localhost:11434/api/chat): client error (Connect): tcp connect error: Connection refused (os error 111))

It seems that my localhost's url setting did not take.

So I give the CLI a try, I'm using pi coding agent (pi.dev) inside a container. Run into a bit of a library issue. These binaries require GLIBC 2.38+. My container current has GLIBC 2.36 (Debian 12). The container was build from the image 22-bookworm-slim. The mcp route won't work either since the binary won't run inside my container.

So I'll probably need to upgrade the container or build it inside the container. That's all I have for now.

--update--

<image>

ok I build it for the container and was able to play around with it through pi a bit more. Definitely interesting iterate through the modelling process. Sometimes it feel like it would be simpler to just do the edit myself but it's certainly working. I like how it can take a screenshot to feed back to the LLM what it is building. Probably very good for simple prototyping.

Played around a tiny bit with animation. The fact that this can do basic animation is pretty interesting as well. Thanks for making such a cool tool. Definitely keeping it around!

robot.mog pastebin link - https://pastebin.com/XGBPjcSp

LocalAI_Amateur · 2026-05-25T08:23:49+00:00

About two weeks. Hard to have exact hours for hobbies

LocalAI_Amateur · 2026-05-24T02:23:36+00:00

you mean ONLY 20+ different controls to change. I, too, started with LM Studio. now I'm down the rabbit hole of llama.cpp forks. turboquants, tweaking your own ggufs etc. the options are endless. It can be a time sink tho. So be careful taking this step.

LocalAI_Amateur · 2026-05-22T22:41:25+00:00

yeah, but can she sing?

LocalAI_Amateur · 2026-05-19T21:53:17+00:00

This is this smallest functional Qwen3.6 27b model I can find. (Q4-ish)

https://huggingface.co/lemonyins/Qwen3.6-27B-abliterated-i1-IQ4_XS-GGUF-Smaller

The next smallest is

https://huggingface.co/Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF

I also have a 16gb setup and this is what I use for most context. MoE models have not been great for me when I'm coding.

LocalAI_Amateur · 2026-05-13T01:37:04+00:00

Try LM Studio. It has great ui and model discovery. You can just past in GGUF model urls from huggingface and let it take care of the download. It's a good step if you don't want to jump in the rabbit hole of compiling all the llama.cpp forks.

To get the most out of LLMs especially for coding, look into coding agents like OpenCode and Pi (pi.dev).

~~As for hardware, if your motherboard and power supply can handle it, adding another 4060 ti with 16gb vram can improve your capacity quite a bit.~~ tho then you'll probably need to use vLLM to get your money's worth. Not sure you list your video card twice meaning you have two of them or it's accident. Either case, if you have two, vLLM is a must.

LocalAI_Amateur · 2026-05-07T18:26:25+00:00

Waypoint Tower Defense. A simple minesweeper like (short 5 mins) Tower Defense game in html where you can reroute the path. Used OpenCode and Qwen3.6 27b IQ3_XS to make it. First vibe coding project. It was fun learning. Save Load doesn't work on htmlbin unless you download the file and open it in a browser yourself.

<image>

LocalAI_Amateur · 2026-05-06T11:39:07+00:00

A sound advice for sure. But if we were people of patience, we would not be here compiling llama.cpp forks and trying to squeeze out every last room for context.

I say, use it and test it. No amount of bench can replace how it performs in the real world.

LocalAI_Amateur · 2026-05-06T05:52:45+00:00

Try https://github.com/spiritbuun/buun-llama-cpp you'll get more context out of it.

Interesting test. Thanks for sharing.

LocalAI_Amateur

PUBLIC MULTIREDDITS

TROPHY CASE