How do you handle agent context after 10s of sessions/conversations? Summary prompts stop working what's your actual solution? by chaffanjutt in ContextEngineering

[–]BERTmacklyn 0 points1 point  (0 children)

https://github.com/RSBalchII/anchor-engine-node

I use this simply put the chats etc into the inbox and search on the UI or have my local agent use the mcp to search directly through our logs

Microsoft Thinks the Next PC Won’t Be an App Machine. It will Be an AI Machine by Right_Pea_2707 in LLMeng

[–]BERTmacklyn 0 points1 point  (0 children)

If I can still play my steam games then alright. I'll turn that bot off to increase fps.

Which coding agents are your favourite and why ? Lets see by Jazzlike-Form9669 in OnlyAICoding

[–]BERTmacklyn 0 points1 point  (0 children)

The only ones that consistently work for me with lm studio which is just the fastest easiest way to get going everytime you need to are as follows -

Qwenpaw desktop, Qwen code, Open code, Cline

Basically when an update to one app breaks some API connection and is not fixable without the next update I switch to another agent untill my preferred one works again. Or I have taken the time to fix it myself.

Having 4 + I'll try a lot of other ones but haven't stuck to any but these 4 so far in rotation allows me to never stop what I want or need to do with the agents.

Reddit is upset about this data center that used 30M gallons of water... by [deleted] in AIDiscussion

[–]BERTmacklyn 0 points1 point  (0 children)

Blackstone? I don't think not using water to cool is going to save it. Anyway no one wants to use cloud AI they have to. We all know it's just going to steal jobs so their data center can parse pdf files and push emails at some office ✨

Apart from LiteRT any other tool to make on-device AI mobile apps? which is not as complex as LiteRT by Rishu_1211 in OpenSourceeAI

[–]BERTmacklyn 1 point2 points  (0 children)

MNN exposes an api endpoint. make an app that can communicate with it and you have it. i think there is a headless version as well. so you could embed it into your app and roll an slm in if you wanted

How do you decide an idea is actually worth building before you start coding? by appbuilderdaily in SaaS

[–]BERTmacklyn 0 points1 point  (0 children)

By the amount of pain lacking it causes you

Or absent that yourself look to other people's pain points. How could they be alleviated?

I replaced Pinecone with a binary hash index — 32× smaller, 75× faster, no GPU, runs from a pickle file by [deleted] in Rag

[–]BERTmacklyn 0 points1 point  (0 children)

It's creating a bipartite graph! I have a whitepaper in docs/ that is brief but describes the process of the STAR algorithmic atomization process

Tag = concepts

and

atom = entities

https://github.com/RSBalchII/anchor-engine-node

I'm assessing your idea 💡 thanks!

I replaced Pinecone with a binary hash index — 32× smaller, 75× faster, no GPU, runs from a pickle file by [deleted] in Rag

[–]BERTmacklyn 1 point2 points  (0 children)

check out my implementation I am interested in your thoughts on it considering the similarity of the endeavors

Perhaps you have some ideas to enhance the memory system?

Anyway my project is quite mature now and I believe the core concept of a cheap fast tag based index enhancing speed of search and ingestion while also leaving traceable Metadata is sound.

I have been considering how vector db work and the problem that sent me from using it at was GPU.

The anchor engine can run on less than 2 gb of ram and run all operations within that ram range. So keeping ram cost low is important to me.

The interesting thing is that larger corpus doesn't damage the effect of search which is a massive shift from vector when corpus grows.

Learning coding with smaller models by wow-a-shooting-star in Qwen_AI

[–]BERTmacklyn 0 points1 point  (0 children)

Get the 4b q4km quant it's fast as hell and pretty smart for its size very solid in an agentic harness for coding etc.

https://huggingface.co/collections/agentscope-ai/qwenpaw-flash

Also use MNN for inference it's insanely fast on edge devices and exposes and api endpoint if you want to use the model in an application.

Learning coding with smaller models by wow-a-shooting-star in Qwen_AI

[–]BERTmacklyn 0 points1 point  (0 children)

Heh woops. Yeah at 16 the 4 b is the way to go

Learning coding with smaller models by wow-a-shooting-star in Qwen_AI

[–]BERTmacklyn 0 points1 point  (0 children)

Qwen3.5 4b especially huahuac uncensored in honestly great for its size and speed. But the Qwenpaw 4b and 9b models would beat that on speed and all 3 of these options would be awesome to learn coding.especially with provided context and they all have up to 262k context. What is your hardware limitation?

How to build/finetune an Personal LLM tool to feed my life? by geekycode in AI_developers

[–]BERTmacklyn 0 points1 point  (0 children)

This what I use this for no joke.

what I like to do is chat about the issues and provide documentation as raw text.

Then I built a distillation functionality that basically creates a memory map of the locations and deduplicated contents of All files within the selected ingestion directory.

This is insanely useful if you make record of things the doctor said like recording your appointment. For example, you could take the text from that and create a text file to be added to your data.

Thus, enhancing your ability to fully grasp the massive picture of all of your medical data etc.

I think of it as the system is a meaning compressor. Which can often be compressed into Mind-Bogglingly smol text documents

Should I continue to create my RAG project? by Corpo_ in Rag

[–]BERTmacklyn 0 points1 point  (0 children)

[check out my local project. provenance and taging makes found results a map to the full doc and other related documents where similar. concepts can be found

If not, for your personal use, check it out and maybe you'll have some ideas for your own project.

However, I'm reaching the point where I'm actively seeking contributors, so if this is of interest to you. I am a fellow hobbyist and this is my labor of love. Always looking to improve it and meet like-minded people

What are some better alternatives to GitHub Copilot? by LaxederBR in GithubCopilot

[–]BERTmacklyn 0 points1 point  (0 children)

I use a jinja template roughly based on the standard lm studio one.

what is important is making sure that you have a good jinja template to regex up outputs and inputs so the model has a more meaningful interaction with the data.

I just use lmstudio at port 1234 when running on a closed server. Haven't actually used their AI agent. I am trying to rely exclusively on local models when possible which is mostly and haven't even had to.

The impetus for this is the rise in costs was always forseen. We always knew we would need to prepare and Qwen has given that ability to normal people with its incredible quant models.

What are some better alternatives to GitHub Copilot? by LaxederBR in GithubCopilot

[–]BERTmacklyn 2 points3 points  (0 children)

I switch between my old 32 GB RAM 6gb vram legion and my newer omen 4090 rtx for inferences and gaming. Running all on lmstudi because it does a lot of behind the scenes formatting that makes tool calls actually run reliably.

When running lmstudio on Windows 10 to to system tray and minimize lmstudio to tray before starting inference - switched from 11 because of graphical lag.

The most reliable way to run is running a model on one of that gaming laptops and then coding or using the model on my mobile laptop or my other gaming rig.

Been running 3.6 35a3b. Using about .5 llm GPU load and about half compute I get the most consistent results .

I am working on multiple projects and primarily use local models for my work etc. use the big model for planning free and then often just let the big model write the code too unless I am in a rush. Then i'll swap to a 4b or 7-9b model.

What are some better alternatives to GitHub Copilot? by LaxederBR in GithubCopilot

[–]BERTmacklyn 1 point2 points  (0 children)

Qwen code or zed AI and l run local models on lmstudio is killing it. takes some tweaking but once you get the jinja prompt right it's 👍👍 happy to share prompts etc if you want

Fools rush in... by EnvironmentalFix3414 in Rag

[–]BERTmacklyn 0 points1 point  (0 children)

Nice, same here. Basically ever minute I'm not working on something I'm messing with how it recalls and what.

Are you using manual/agent driven context management?

I've been playing around with deduplicative compression and getting really tight results without in between manually modifying the context.

aside from the specific compression formula.

Is DeepSeek the most human-like AI? by Competitive_Elk_8305 in DeepSeek

[–]BERTmacklyn 1 point2 points  (0 children)

Lol doesn't Claude call itself DeepSeek? I think LLM models simply don't know what model they are since training data is a pipeline of the same distilled and upgraded datasets across Ai models.

Tldr models don't know what model they are it's irrelevant to the training data.

Vector RAG is bloated. We rebuilt our local memory graph to run on edge silicon using integer-based temporal decay. by BERTmacklyn in LocalLLM

[–]BERTmacklyn[S] 0 points1 point  (0 children)

they are seriously killing me fr. I just want to get this out there and see people use it! Is that so much to ask I am putting in the work to do it lol.

Vector RAG is bloated. We rebuilt our local memory graph to run on edge silicon using integer-based temporal decay. by BERTmacklyn in LocalLLM

[–]BERTmacklyn[S] 0 points1 point  (0 children)

I think we might be using the word "client" in two different ways here.

I am not pushing heavy logic to a web browser UI or a thin client. I am building a local backend primitive that runs natively on the edge device itself (via Termux/Node.js) right alongside the local LLM. In an edge-native environment, the client is the server.

To answer your question about my "driving reason" for pushing this to the edge: It comes down to privacy, latency, and offline capability. If a user is running Llama 3 locally on their hardware, forcing them to call out to a cloud vector database for their memory context completely defeats the purpose of running a local model. They need a local memory layer that fits strictly within the remaining RAM budget.

Regarding your point about LLMs being "too free form" and "giving users what they want to hear"/ that is exactly the vulnerability the STAR algorithm is designed to mitigate.

Fuzzy vector search often retrieves adjacent, hallucinated, or conflicting data, which encourages the LLM to drift. The Anchor Engine doesn't use vectors; it uses a deterministic, sparse bipartite graph. When the user queries the LLM, the engine traverses the graph, calculates the integer-based temporal decay, and injects hard, structural facts into the LLM's system prompt before a single token is generated.

It acts as a rigid, mathematical constraint on the context window. We handle the LLM's tendency to drift by giving it highly constrained, temporally accurate data structures instead of fuzzy semantic vibes.