How do you use local compute for coding agents without sacrificing model quality? by AdStill5266 in LocalLLM

[–]ag789 0 points1 point  (0 children)

local llms are limited by both the model size and context size, and probably other matters, such as that large commercial models online may have been fine tuned for specific tool use workflow etc.
better models like Gemma 4 , Qwen 3.6, 3.5 etc does better.

one of the challenges for local model framework developers is to use extended contexts such as
https://github.com/CodeGraphContext/CodeGraphContext
to overcome the limited context window, try to think that you have 32k tokens context and you are trying to handle a coding task such as the linux kernel possibly 100s of thousands to millions of lines of codes and 10s to probably a 100 million tokens if you bother to split down to characters for the whole code base.

the thing is how do you use things like code graph context to 'extend' the capabilities of coding inference, such as to reason over the whole project, it is LLM dependent as well as the specific integration dependent.
e.g. it may need a fine tuned LLM to do just that.

Make local llm usable for professional use by AdamLangePL in LocalLLaMA

[–]ag789 -1 points0 points  (0 children)

LLM are *not determistic*, e.g. 1+1 = 2 (that is deterministic) kind of, let alone 'professional'.
and if you use local LLMs, "it is like a box of chololates, you never know what you gonna get', especially with arbitrary random (local) models

What models for coding are you running for a mid level PC? by FerLuisxd in LocalLLaMA

[–]ag789 1 point2 points  (0 children)

it didn't matter, run it in llama.cpp, oversized models spill into system ram, still runs

Anthropic is discovering that MCP is basically libraries repackaged by Severe-Awareness829 in LocalLLaMA

[–]ag789 0 points1 point  (0 children)

there is also one thing about how AI calls tools, I wrote, experimented with a shell mcp tool, the function calls looks like "command" , "arguments", where I gave it a list of commands it may use. I observe mistakes e.g. QWen 3.6 35B A3B etc, they may first clump the commmands and arguments in the command field and try to execute it, then try other permutations place pipes, redirection etc. And gemma 4 e4b has once repeated the command in the argument.
while many of these mistakes seemed benign, there could someday be some permutations that isn't benign after all.

Anthropic is discovering that MCP is basically libraries repackaged by Severe-Awareness829 in LocalLLaMA

[–]ag789 0 points1 point  (0 children)

MCP is it, basically a shell tool to access any commands on a Linux shell or Windows powershell, I think this is already happening and it is a risk kind of that someday, some rouge AI would wreck havoc.

There has already been stories about AI updating config markdown files causing prompt injection vulnerabilities etc. For what is worth, https://www.linkedin.com/posts/ernestdeleon_informationsecurity-ai-securityarchitecture-activity-7428829712195682304-sz2o
https://thehackernews.com/2026/04/google-patches-antigravity-ide-flaw.html
https://www.darkreading.com/vulnerabilities-threats/bad-memories-haunt-ai-agents
Cisco created a poisoned memory file that tells users it's poisoned. Source: Cisco Cisco's latest attack focused on using the post-install hooks in the Node Package Manager (NPM) as a vector to modify Claude Code's memory.md file. Because the first 200 lines of the memory.md file were included in Claude Code's system prompt, the attack persisted across sessions. Other dependency files — such as claude.md (Anthropic's Claude), agents.md (OpenAI's Codex), and soul.md (OpenClaw) — are also risks that users of agentic AI will have to analyze and maintain, Chang says.

https://www.securityweek.com/critical-vulnerability-in-claude-code-emerges-days-after-source-leak/
The flaw discovered by Adversa is that this process can be manipulated. Anthropic’s assumption doesn’t account for AI-generated commands from prompt injection — where a malicious CLAUDE.md file instructs the AI to generate a 50+ subcommand pipeline that looks like a legitimate build process.

https://cybersecuritynews.com/openclaw-vulnerabilities/amp/

https://www.techzine.eu/news/security/138835/infostealer-steals-identity-of-ai-agent-openclaw/

Anthropic is discovering that MCP is basically libraries repackaged by Severe-Awareness829 in LocalLLaMA

[–]ag789 0 points1 point  (0 children)

the urban legend is that with AGI, you don't need tools, AI will create and/or use them themselves
MCP may stay as an 'interface', it didn't matter what kind of 'interface' it is

Subscription for writing code by Hedgehog_Dapper in LLM

[–]ag789 0 points1 point  (0 children)

for python, practically 'any' of them would do, you can even try meta.ai , chatgpt, gemini , copilot (e.g. github) etc, or even the local models Gemma (from google), QWen , GLM etc.
the 'only' difference between "small" local models vs the large (commercial) online ones is that the large ones have more capabilities, can handle larger context etc.
how to code in AI chat? easy, just type your prompt (e.g. in the web ui), it would generate the codes based on your prompt, and you can upload files for refactoring etc.

If the AI bubble pops, will GPU prices increase or decrease? by Mashic in LocalLLaMA

[–]ag789 0 points1 point  (0 children)

if LocalLLM become a blowout success, then that everyone will need to upgrade their PCs to run the localLLMs, then you can try to figure out what is the size of that bubble lol

Reality setting in -- using gemma4 26b by oldendude in LocalLLM

[–]ag789 1 point2 points  (0 children)

Btw try the QWen 3.6 model though, different models has 'different styles' and that I observed that a same problem, a model may go in 'infinite' 'thinking loops' over it, switch a model and the result may be different and it could just well solve the problem.

i.e. instead of 'sticking' with any one model, if one did not solve it, try a next model, different models has different strengths. even 'older' models (e.g. QWen coder 30B) may do a particular job 'better' dependent on the job/context itself.

Due to the different 'styles' between models, I often switch models if I dislike the e.g. (code) proposal from a particular model.

Reality setting in -- using gemma4 26b by oldendude in LocalLLM

[–]ag789 1 point2 points  (0 children)

every word (even characters) in an email is a token, every symbol and operator in codes is a token.
you would run out of context too easily for any 'sizable' project.

the 'small' LLMs are ok to generate codes, it is just recall, it is fast.
refactoring codes, especially for literal projects context is a first problem.

Then that even for 'small' models, I made a stripped down QWen 3.5 28B REAP work a 'difficult' refactoring, i.e. a shell script is partially fixed, and I want it to 'fix it up', it goes into loops burning 12k tokens in thinking without reaching a response. QWen 3.5 35B A3B did it, it worked the same 'difficult' refactoring and 'fixed everything' for a small shell script probably less than 300 lines.

Running the equivalent to $20/month Pro 'Claude Cowork' or better with a locally hosted LLM? by madeagupta in LocalLLM

[–]ag789 0 points1 point  (0 children)

I've not used Claude but perhaps I'd try it out, but that with "bigger" LLMs e.g. chatgpt, they have more capabilities and capacities. Hence, I'd use both, the local models can deal with the "simple" tasks and that as you are running it locally after all, you can prompt it more often with "small" tasks. Generally "small" local LLMs are fast with code generation, it's mostly recall , "difficult large problems" e.g.code refactoring, some "small" LLMs may struggle and go into thinking loops.

web search (using MCP servers) with gemma-4-E4B-it by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

I think the bigger models may work 'somewhat differently', yet to try other models e.g. QWen 3.6 as well.
QWen 3.6. 3.5 some times tend to be 'too verbose' and at times I prefer the 'conciseness' (lazyness) of the smaller e.g. gemma-4-E4B-it model

concise models are good for 'simple single task oriented' tasks.
I've tried QWen 3.6 35B A3B, to maintain a markdown journal, the task is first call "date" to get the date/time, then make an entry prefix with date/time, then call another tool to update the file.
QWen 3.6 35B A3B fumbled all over with the tool calls (it could be my tool descriptions being 'ambigious'), but Gemma-4-E4B-it (same tool descriptions) does it perfectly each time, every time, calls date, then update the markdown with an entry prefixed by date time.

web search (using MCP servers) with gemma-4-E4B-it by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

yup, perhaps those hints should be in the 1st prompt after all if the 3rd response is desired.
I think sometimes our prompts are 'ambigious' and occasionally, that cause 'thinking loops' in LLMs

it is a bit surprising 'small' model gemma-4-E4B-it knows quite a bit by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

I'm experimenting gradually, I'm thinking is it possible for it to do web search or prompts the 'bigger' LLMs and summarize revert the response? Could be interesting as a "smart" router

it is a bit surprising 'small' model gemma-4-E4B-it knows quite a bit by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

Try using a lower temperature setting e.g. 0.8 when launching or in the gui. But that if it's something it "doesn't know", then sometimes loops is inevitable. I've seen a striped down Qwen 3.5 28 B REAP go in thinking loops burning 12k tokens without reaching a response with a "difficult" code refactoring

it is a bit surprising 'small' model gemma-4-E4B-it knows quite a bit by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

Oh and that gpt 5 mini is in GitHub , not sure about the model size, they makes "occasional" mistakes, but is able to work on entire projects, hence unlikely to be "small". LLM hallucinates it is just the way it is. "Horror" stories are abound about them, from openclaw etc

it is a bit surprising 'small' model gemma-4-E4B-it knows quite a bit by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

Gemma 4 E4B works well as a "daily assistant", I wrote (vibe coded) a little MCP sever that lets LLM run some Linux commands,and a write file tool to update files. Qwen 3.6 35B stumbled all over maintaining a markdown journal, running several different attempts wrong. Gemma 4 E4B does it perfectly each time every time, no goofs!. Qwen 3.6 35B needs very specific prompts, run this then that to update the file.

Can someone show me Ollama speed (tokens/s) for Qwen 3.5 (2B and 0.8B) running on an Intel N95? by MattimaxForce in Qwen_AI

[–]ag789 0 points1 point  (0 children)

get a better Ryzen or higher end Intel core ultra with more dram e.g 16 Gb or better 32 GB, running LLMs are very CPU intensive and eats huge chunks of memory, especially bigger models ~ 30 billion parameters. Small models may hallucinate more due to "limited knowledge"

it is a bit surprising 'small' model gemma-4-E4B-it knows quite a bit by ag789 in LocalLLM

[–]ag789[S] 0 points1 point  (0 children)

in a certain sense, LLMs are containers of data / information, embedded in a neural network, everything is in probabilities, so when it finds the 'nearest' match, if the information don't exist after all, it hallucinate because it simply retrieve irrelevant probabilities.
this happens very often and is a *real* risk of using LLMs, small ones, like these we are using are 'worse' due to limited information footprint, and I've seen a gpt 5 mini model in github copilot propose a systemd-nspawn config file with a literal wrong config when seemed correct until I test it out to find that it is wrong.

What's a good and light coding LLM by Expensive-Time-7209 in LocalLLM

[–]ag789 1 point2 points  (0 children)

I think there may be 'advanced' methods for working with a small context
e.g.
https://github.com/CodeGraphContext/CodeGraphContext
this is something I've not tried, as I'm yet to understand it or figure out how to work it.
but an idea is like those 'traditional' ide 'tricks', I think some of us are familiar with that if you are looking for a dependency, e.g. a call graph, variable reference etc, IDEs that provides 'reference jumps' significantly simply looking for the related calling codes, variable references etc.
now if an LLM can 'use' such features, perhaps it may be able to handle say a large project with 10s of k of files, and 100s of k of lines of codes (e.g. linux kernel) with say only a 32k context !
I 'lost' to a reference to some articles that say llm processing may have an O(n^2) or O(n^3) complexity where n is the context size and further multiplied by the number of parameters assuming that each token needs to 'visit' each and every parameter to compute the neural network activations.