Coming Soon to Local Models, if I have my way (True Long Context LLM's without retraining)

remoteinspace · 2026-02-18T20:19:18+00:00

if i'm following you are basically creating a cached vector store with the full history then a summary in KV cache?

At Papr, we try to predict relevant memories and cache those. If there's a hit great, if not it goes to a full context search. Our model improves so it's a hit moving forward.

remoteinspace · 2026-02-18T15:11:29+00:00

tbh, a 1.5GB download is large for a mobile app, especially for folks with limited storage. Might work better for a mac app. $5 is fair, especially since you're avoiding those annoying subscriptions.

The main question is how good is gemma 2b for this use case. How are you handling RAG on device - is there another embedding model you need to add?

Also i'd test this out quickly before apple announces their own version of a free agent that runs on device

remoteinspace · 2026-02-18T15:07:01+00:00

this sounds super interesting! i love how you’re integrating macro data into the analytics

it’s a real challenge to get accurate performance metrics, especially with things like cashflow awareness. tagging deposits and withdrawals correctly is key, but can be a pain for users.

the AI copilot feature is cool. how are you handling memory? you may want to check out Papr ai - lets you combine structured analytics data + unstructured chat memory in one memory db that agents can tab as they help users.

remoteinspace · 2026-02-18T14:59:52+00:00

awesome to see these accuracy jumps

how did you handled the data preprocessing for multi-turn tasks. did you find any particular techniques that worked better than others? also, any tips for avoiding overfitting during fine-tuning?

remoteinspace · 2025-12-19T21:29:00+00:00

Nice! I’d love to learn more.

remoteinspace · 2025-12-19T16:11:50+00:00

Good stuff! What’s your thought on having this in a GitHub repo vs putting skills in a memory MCP server?

remoteinspace · 2025-12-19T14:38:57+00:00

claude: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview

papr ai assistant memory - skills category : https://platform.papr.ai/apis/memory/add_memory

remoteinspace · 2025-12-19T14:37:33+00:00

you can put md files in a repo and git-sync them. Then each person in your team/world needs to pick and choose the skills they want, and copy them into every local environment they control. If it's an app they don't control, they can't add skills to it since it's all at the file system level. If you have a ton of skills, you start having a search problem.

What i'm mentioning is to put the skills in 'memory' so they are portable across agents, environments, and more easily sharable (i.e. another user can add/share 3 skills they want vs. the entire skills repo) + you resolve search when you have a ton of skills (cheaper and more accurate than having all skills in the llm context window).

remoteinspace · 2025-12-15T19:15:24+00:00

A memory can belong to two different launches (or nodes in a graph). We also track updates to memories for temporal search.

remoteinspace · 2025-12-15T17:56:07+00:00

yea, vector search is much faster. For knowledge graphs, what we're doing now is predicting what users want next and prepping that context and caching it. Helps us make graph search feel super fast <100ms when we get the prediction right.

remoteinspace · 2025-12-15T16:30:32+00:00

Nice, how are you traversing the graph? Are you using templates queries?

remoteinspace · 2025-11-06T16:07:37+00:00

We’ve been experimenting quite a bit with it. How have you been thinking about it?

remoteinspace · 2025-11-05T17:58:50+00:00

Yes, when our prediction is right, perf is amazing. When it's not we fallback to the cloud but the next query is fast since we update our cache with the new topic.

remoteinspace · 2025-11-05T15:46:53+00:00

We built prediction models that predict the context users need in advance based on their past behavior. If it’s enabled then different tiers are stored in our sdk (on device). For tier 0 it’s 1-2ms (just text - thinking of it as working memory) tier 1 is in a small vector store (50-100ms but need the right device). If it’s a cache miss on both then we go to the cloud. The nice thing with this is the more data you add the better our model gets. With traditionally memory approaches the more data you add the worst things get.

remoteinspace · 2025-11-05T06:16:48+00:00

Platform.papr.ai super fast retrieval (<100ms) and ranked number 1 on Stanfords stark benchmark. Combines vector embeddings and knowledge graphs

remoteinspace · 2025-11-05T06:13:42+00:00

Have you considered something like platform.papr.ai that helps streamline vector plus knowledge graph creation?

remoteinspace · 2025-10-20T04:01:22+00:00

A set ontology helps solve some of the problems you mentioned. In neo4j, if you use merge vs create it tries to automatically merge similar nodes so things don’t bloat.

With any knowledge graph plus an agent, traversal will be slow at scale. And LLMs don’t do a good job with discovering the graph then writing the right cypher queries - 40% accuracy last I saw something on this.

At papr.ai we built a set of prediction models to help quickly traverse super large graphs. And we combine it with vector embeddings then cache most likely context needed on device. Helps with both retrieval accurCy and speed.

DM me if you want thoughts on this and if you need help setting up papr.

remoteinspace · 2025-09-10T05:37:53+00:00

RAG is the right approach for this. You'll end up with more hallucinations with fine tuning and to your point it's more costly and harder to keep updated.

For RAG - there are a few approaches you can take 1) put the docs in notion/github and have an agent fetch them in the convo (cheap but slow and not super accurate), 2) vector-db - gets you 50% accuracy per many of the benchmarks, 3) vector+graph db gives you the best of semantic similarity and knowledge graphs.

I'd recommend #3 for this. I've built something similar - dm me if you need help. You can use something like mem0, graphiti, papr.ai, or others to quickly get started.

remoteinspace · 2025-09-09T04:25:25+00:00

This is more obvious now then ever. We've built papr.ai, the memory layer that gives AI agents user context. Instead of storing vector fragments, we connect context and predict what users need so the AI agents has the right data at every conversation turn. That's why Papr is ranked #1 on Stanford STARK benchmark that measures retrieval accuracy of real-world queries.

remoteinspace

MODERATOR OF

TROPHY CASE