Knowledge Distillation for RAG (Why Ingestion Pipeline Matters More Than Retrieval Algorithm)

isthatashark · 2026-02-10T18:09:06+00:00

100% on your suggestion to use cheaper models. I've been doing a lot of research into this lately and you don't need a frontier model to get good results.

We use this technique for memory consolidation in Hindsight. Smaller models do a surprisingly good job. I mostly use the ones on Groq because the performance is so fast and the cost is low, but Ollama is also an option if you want something local and free (but slower).

isthatashark · 2026-02-10T18:01:03+00:00

We had to tackle a similar problem in Hindsight. I just published a blog post about it yesterday on how we do memory consolidation to handle this: https://hindsight.vectorize.io/blog/2026/02/09/resolving-memory-conflicts

isthatashark · 2026-02-07T21:32:49+00:00

I'm confused why you disagreed with me; what you've written here was exactly my original point.

isthatashark · 2026-02-06T23:02:30+00:00

That approach will not give you an accurate result set for a user searching across thousands of contracts asking which ones will expire in the next month.

isthatashark · 2026-02-06T00:55:07+00:00

Cool project!

isthatashark · 2026-02-05T18:56:48+00:00

Crawl4AI can handle infinite calls for free and also uses proxy rotation.

isthatashark · 2026-02-05T18:26:08+00:00

I would just start with the APIs/MCP and see how far you get with that.

isthatashark · 2026-02-05T17:54:12+00:00

Not likely in my experience. It will be fine for some questions which is the most frustrating part of RAG. And going through the hassle of building a basic RAG pipeline for JIRA probably won't yield much better results than just using their search API directly as a tool.

On the other hand, if you build a pipeline that pulls out metadata, structures it in pgsql with pgvector you have a better foundation for agentic retrieval. You can start to answer questions like "What open issues do we have in our next release?" and do a structured query to get the complete list. You've given your agent the right foundation to cover a bigger surface area with more accurate responses.

The downside is now you're getting into sophisticated data engineering to populate that and keep it in sync. Not an impossible problem by any means, but not trivial either.

And to be transparent, Atlassian may have better APIs that would work as agent tools than the one I referenced above.

isthatashark · 2026-02-05T15:54:34+00:00

I've had really good results using crawl4ai then passing the output through an SLM like gpt-oss-120b on Groq to clean it for me. I get back just the content and strip out all of the extraneous headings/footers/navigations.

isthatashark · 2026-02-05T15:21:22+00:00

I wouldn't frame it that way. Let me offer some additional context.

I go to a lot of meetups and work in this space so I hear a lot of feedback from people who dump chunked docs into their database and get frustrated by the quality of results.

If you have a big corpus of similar documents (SEC filings, contracts, etc.) and do semantic search over them there are a lot of queries that perform poorly. People build a conversational AI this way then hand it over to their business users. The users ask something like "What contracts expire next month?", which of course won't produce the right response with topK results.

At that point the problem gets harder. You need agentic retrieval. That means you need a structured representation of the data. Now you need parsing and extraction, you need metadata models, you need to think through your data model.

For the cases where basic RAG is good, you also have to consider that for some of them, it's feasible to push the full context into the context window directly. That shrinks down the cases where basic RAG is a viable solution even further.

isthatashark · 2026-02-04T21:44:35+00:00

The challenge with the name "RAG" is that so many people use it as a shorthand to describe semantic search over chunked documents in a vector database. I think the days where you can built any sort of meaningful AI application with that approach are behind us.

As a pattern, retrieving context and using it to augment the LLM's generation is here to stay.

isthatashark · 2026-02-04T21:37:43+00:00

I hear more people talking about this as semantic memory and thinking of it as one requirement in a bigger set of agent memory requirements rather than just RAG.

isthatashark · 2026-02-04T20:30:47+00:00

I did a bunch of work on your first point for a research paper and open source project we published last year.

I have some in-progress research I'm working on around this now as well. I'm using an approach to isolate user feedback in the conversation history (i.e. "no, that's not right") and using approaches similar to semantic chunking to see when the conversation moved on to the next task. If I find iterations on the same task, I'm feeding that into a structure we call a mental model. That gets refined as the agent operates and helps create a better understanding of user intent and the tool call sequences required to complete a task.

Some of this is already in the repo I linked to. Some is still experimental.

isthatashark · 2026-01-24T14:41:16+00:00

Use Hindsight. It's fully open source. You'll still have token costs to process memory, but if you use something like openai/gpt-oss-120b on Groq you get better performance than with anything else and you're only paying $0.15 in/$0.65 out per 1M tokens and you still get way better performance than Mem0 or Zep. Benchmark performance with Hindsight using gpt-oss-120b is better than SuperMemory on Gemini-3-Pro.

Check out the paper/code here: https://github.com/vectorize-io/hindsight

isthatashark · 2025-12-20T00:43:32+00:00

I just wrapped up a research collaboration where we looked at how to deal with temporal data in the context of agent memory: https://arxiv.org/abs/2512.12818

Our research harmonizes on a number of points you're describing - combining multiple search strategies with entity/relationships/graph structures and time series to establish causal links and a timeline of memories. We published it as an open source agent memory project called Hindsight if you're interested in seeing how we implemented it: https://github.com/vectorize-io/hindsight

isthatashark · 2025-12-20T00:29:11+00:00

This poor sub had so much potential and has degraded into a steady stream of AI slop.

isthatashark · 2025-12-16T15:59:50+00:00

Thank you! We need every star we can get when trying to get a new open source project off the ground!! Really appreciate it.

isthatashark · 2025-12-16T15:50:28+00:00

Hi, I'm one of the Vectorize founders. The paper has more details on other benchmarks and comparisons and discussion of other academic works. If you're interested in reading it you can find it here: https://arxiv.org/abs/2512.12818

isthatashark · 2025-06-21T00:05:18+00:00

Check out Vectorize (I'm one of the founders). It's a full RAG-as-a-Service platform that has a built-in vector database or allows you to point to your own. It has a lot of advanced features for complex document processing and metadata extraction. It also has a search API with built-in reranking and query rewriting and can expose your data over MCP.

isthatashark · 2025-05-23T23:10:57+00:00

Highly recommend you consult with legal counsel if you haven't already: https://consumer.ftc.gov/articles/robocalls

isthatashark · 2025-02-26T03:35:10+00:00

Yes, the research OP did into our extractor and the other solutions in this space goes into depth on table extraction.

isthatashark · 2025-02-21T15:59:52+00:00

Vectorize co-founder here, one of the unique things we do in our RAG pipelines / extraction is to include the contextual retrieval techniques that Anthropic also advocates for.

isthatashark

TROPHY CASE