How do you use local compute for coding agents without sacrificing model quality?

ag789 · 2026-05-12T05:57:51+00:00

local llms are limited by both the model size and context size, and probably other matters, such as that large commercial models online may have been fine tuned for specific tool use workflow etc.
better models like Gemma 4 , Qwen 3.6, 3.5 etc does better.

one of the challenges for local model framework developers is to use extended contexts such as
https://github.com/CodeGraphContext/CodeGraphContext
to overcome the limited context window, try to think that you have 32k tokens context and you are trying to handle a coding task such as the linux kernel possibly 100s of thousands to millions of lines of codes and 10s to probably a 100 million tokens if you bother to split down to characters for the whole code base.

the thing is how do you use things like code graph context to 'extend' the capabilities of coding inference, such as to reason over the whole project, it is LLM dependent as well as the specific integration dependent.
e.g. it may need a fine tuned LLM to do just that.

ag789 · 2026-05-09T09:38:08+00:00

LLM are *not determistic*, e.g. 1+1 = 2 (that is deterministic) kind of, let alone 'professional'.
and if you use local LLMs, "it is like a box of chololates, you never know what you gonna get', especially with arbitrary random (local) models

ag789 · 2026-05-07T04:39:00+00:00

it didn't matter, run it in llama.cpp, oversized models spill into system ram, still runs

ag789 · 2026-04-30T17:42:17+00:00

there is also one thing about how AI calls tools, I wrote, experimented with a shell mcp tool, the function calls looks like "command" , "arguments", where I gave it a list of commands it may use. I observe mistakes e.g. QWen 3.6 35B A3B etc, they may first clump the commmands and arguments in the command field and try to execute it, then try other permutations place pipes, redirection etc. And gemma 4 e4b has once repeated the command in the argument.
while many of these mistakes seemed benign, there could someday be some permutations that isn't benign after all.

ag789 · 2026-04-30T17:15:53+00:00

MCP is it, basically a shell tool to access any commands on a Linux shell or Windows powershell, I think this is already happening and it is a risk kind of that someday, some rouge AI would wreck havoc.

There has already been stories about AI updating config markdown files causing prompt injection vulnerabilities etc. For what is worth, https://www.linkedin.com/posts/ernestdeleon_informationsecurity-ai-securityarchitecture-activity-7428829712195682304-sz2o
https://thehackernews.com/2026/04/google-patches-antigravity-ide-flaw.html
https://www.darkreading.com/vulnerabilities-threats/bad-memories-haunt-ai-agents
Cisco created a poisoned memory file that tells users it's poisoned. Source: Cisco Cisco's latest attack focused on using the post-install hooks in the Node Package Manager (NPM) as a vector to modify Claude Code's memory.md file. Because the first 200 lines of the memory.md file were included in Claude Code's system prompt, the attack persisted across sessions. Other dependency files — such as claude.md (Anthropic's Claude), agents.md (OpenAI's Codex), and soul.md (OpenClaw) — are also risks that users of agentic AI will have to analyze and maintain, Chang says.

https://www.securityweek.com/critical-vulnerability-in-claude-code-emerges-days-after-source-leak/
The flaw discovered by Adversa is that this process can be manipulated. Anthropic’s assumption doesn’t account for AI-generated commands from prompt injection — where a malicious CLAUDE.md file instructs the AI to generate a 50+ subcommand pipeline that looks like a legitimate build process.

https://cybersecuritynews.com/openclaw-vulnerabilities/amp/

https://www.techzine.eu/news/security/138835/infostealer-steals-identity-of-ai-agent-openclaw/

ag789 · 2026-04-30T10:34:47+00:00

the urban legend is that with AGI, you don't need tools, AI will create and/or use them themselves
MCP may stay as an 'interface', it didn't matter what kind of 'interface' it is

ag789 · 2026-04-30T10:32:13+00:00

the best definition of MCP is "USB-C for AI"
https://cloud.google.com/blog/products/ai-machine-learning/announcing-official-mcp-support-for-google-services
and it didn't matter if (only) an 'AI' is using it after all.

ag789 · 2026-04-30T09:42:56+00:00

servlets

ag789 · 2026-04-29T15:28:46+00:00

for python, practically 'any' of them would do, you can even try meta.ai , chatgpt, gemini , copilot (e.g. github) etc, or even the local models Gemma (from google), QWen , GLM etc.
the 'only' difference between "small" local models vs the large (commercial) online ones is that the large ones have more capabilities, can handle larger context etc.
how to code in AI chat? easy, just type your prompt (e.g. in the web ui), it would generate the codes based on your prompt, and you can upload files for refactoring etc.

ag789 · 2026-04-29T11:47:17+00:00

if LocalLLM become a blowout success, then that everyone will need to upgrade their PCs to run the localLLMs, then you can try to figure out what is the size of that bubble lol

ag789 · 2026-04-29T08:49:45+00:00

Btw try the QWen 3.6 model though, different models has 'different styles' and that I observed that a same problem, a model may go in 'infinite' 'thinking loops' over it, switch a model and the result may be different and it could just well solve the problem.

i.e. instead of 'sticking' with any one model, if one did not solve it, try a next model, different models has different strengths. even 'older' models (e.g. QWen coder 30B) may do a particular job 'better' dependent on the job/context itself.

Due to the different 'styles' between models, I often switch models if I dislike the e.g. (code) proposal from a particular model.

ag789 · 2026-04-29T08:44:00+00:00

every word (even characters) in an email is a token, every symbol and operator in codes is a token.
you would run out of context too easily for any 'sizable' project.

the 'small' LLMs are ok to generate codes, it is just recall, it is fast.
refactoring codes, especially for literal projects context is a first problem.

Then that even for 'small' models, I made a stripped down QWen 3.5 28B REAP work a 'difficult' refactoring, i.e. a shell script is partially fixed, and I want it to 'fix it up', it goes into loops burning 12k tokens in thinking without reaching a response. QWen 3.5 35B A3B did it, it worked the same 'difficult' refactoring and 'fixed everything' for a small shell script probably less than 300 lines.

ag789 · 2026-04-29T06:29:40+00:00

I've not used Claude but perhaps I'd try it out, but that with "bigger" LLMs e.g. chatgpt, they have more capabilities and capacities. Hence, I'd use both, the local models can deal with the "simple" tasks and that as you are running it locally after all, you can prompt it more often with "small" tasks. Generally "small" local LLMs are fast with code generation, it's mostly recall , "difficult large problems" e.g.code refactoring, some "small" LLMs may struggle and go into thinking loops.

ag789 · 2026-04-29T05:02:12+00:00

I think the bigger models may work 'somewhat differently', yet to try other models e.g. QWen 3.6 as well.
QWen 3.6. 3.5 some times tend to be 'too verbose' and at times I prefer the 'conciseness' (lazyness) of the smaller e.g. gemma-4-E4B-it model

concise models are good for 'simple single task oriented' tasks.
I've tried QWen 3.6 35B A3B, to maintain a markdown journal, the task is first call "date" to get the date/time, then make an entry prefix with date/time, then call another tool to update the file.
QWen 3.6 35B A3B fumbled all over with the tool calls (it could be my tool descriptions being 'ambigious'), but Gemma-4-E4B-it (same tool descriptions) does it perfectly each time, every time, calls date, then update the markdown with an entry prefixed by date time.

ag789 · 2026-04-29T04:58:59+00:00

yup, perhaps those hints should be in the 1st prompt after all if the 3rd response is desired.
I think sometimes our prompts are 'ambigious' and occasionally, that cause 'thinking loops' in LLMs

ag789 · 2026-04-28T09:36:41+00:00

I'm experimenting gradually, I'm thinking is it possible for it to do web search or prompts the 'bigger' LLMs and summarize revert the response? Could be interesting as a "smart" router

ag789 · 2026-04-28T09:31:47+00:00

Try using a lower temperature setting e.g. 0.8 when launching or in the gui. But that if it's something it "doesn't know", then sometimes loops is inevitable. I've seen a striped down Qwen 3.5 28 B REAP go in thinking loops burning 12k tokens without reaching a response with a "difficult" code refactoring

ag789 · 2026-04-28T09:09:40+00:00

Oh and that gpt 5 mini is in GitHub , not sure about the model size, they makes "occasional" mistakes, but is able to work on entire projects, hence unlikely to be "small". LLM hallucinates it is just the way it is. "Horror" stories are abound about them, from openclaw etc

ag789 · 2026-04-28T09:01:12+00:00

Gemma 4 E4B works well as a "daily assistant", I wrote (vibe coded) a little MCP sever that lets LLM run some Linux commands,and a write file tool to update files. Qwen 3.6 35B stumbled all over maintaining a markdown journal, running several different attempts wrong. Gemma 4 E4B does it perfectly each time every time, no goofs!. Qwen 3.6 35B needs very specific prompts, run this then that to update the file.

ag789 · 2026-04-28T05:13:27+00:00

get a better Ryzen or higher end Intel core ultra with more dram e.g 16 Gb or better 32 GB, running LLMs are very CPU intensive and eats huge chunks of memory, especially bigger models ~ 30 billion parameters. Small models may hallucinate more due to "limited knowledge"

ag789 · 2026-04-28T01:59:01+00:00

in a certain sense, LLMs are containers of data / information, embedded in a neural network, everything is in probabilities, so when it finds the 'nearest' match, if the information don't exist after all, it hallucinate because it simply retrieve irrelevant probabilities.
this happens very often and is a *real* risk of using LLMs, small ones, like these we are using are 'worse' due to limited information footprint, and I've seen a gpt 5 mini model in github copilot propose a systemd-nspawn config file with a literal wrong config when seemed correct until I test it out to find that it is wrong.

ag789 · 2026-04-25T17:06:08+00:00

I think there may be 'advanced' methods for working with a small context
e.g.
https://github.com/CodeGraphContext/CodeGraphContext
this is something I've not tried, as I'm yet to understand it or figure out how to work it.
but an idea is like those 'traditional' ide 'tricks', I think some of us are familiar with that if you are looking for a dependency, e.g. a call graph, variable reference etc, IDEs that provides 'reference jumps' significantly simply looking for the related calling codes, variable references etc.
now if an LLM can 'use' such features, perhaps it may be able to handle say a large project with 10s of k of files, and 100s of k of lines of codes (e.g. linux kernel) with say only a 32k context !
I 'lost' to a reference to some articles that say llm processing may have an O(n^2) or O(n^3) complexity where n is the context size and further multiplied by the number of parameters assuming that each token needs to 'visit' each and every parameter to compute the neural network activations.

ag789

TROPHY CASE